Skip to main content
Tau-Bench measures how well an agent can act as a customer service representative — reading a policy wiki, using data-access tools, and satisfying a simulated user request without violating business rules. Tasks are graded by comparing the final database state against a ground-truth action sequence. An agent scores 1.0 only when its tool calls produce exactly the right state transitions and its responses contain all required output values. QitOS ships TauBenchAdapter and a self-contained runtime (TauRuntimeEnv) so you can run Tau-Bench without installing the upstream tau_bench package. All task data, tools, wiki, and rules are vendored under qitos.benchmark.tau_bench.port.

Environments

EnvironmentSplits availableTerminate tool
retailtrain, dev, testtransfer_to_human_agents
airlinetesttransfer_to_human_agents

Setup

1

Install benchmark dependencies

pip install "qitos[benchmarks]"
2

Set your model API key

export OPENAI_API_KEY="sk-..."
Tau-Bench task data is vendored inside the QitOS package. You do not need to download any external dataset.

Loading tasks

from qitos.benchmark import TauBenchAdapter

adapter = TauBenchAdapter(env_name="retail", task_split="test")
records = adapter.load_records()
tasks = adapter.to_tasks(records, split="test", limit=5)

print(tasks[0].id)          # "tau_retail_test_00000"
print(tasks[0].objective)   # Customer service instruction
print(tasks[0].inputs["env"])   # "retail"
print(tasks[0].inputs["reference_outputs"])  # Expected response strings
One-line convenience loader:
from qitos.benchmark.tau_bench.adapter import load_tau_bench_tasks

tasks = load_tau_bench_tasks(env_name="retail", split="test", limit=10)

Configuration

TauBenchAdapter accepts the following parameters:
ParameterDefaultDescription
env_name"retail"Environment to load: "retail" or "airline"
task_split"test"Split to load: "train", "dev", or "test"
default_max_steps30Step budget per task
include_raw_recordTrueAttach raw task dict to task.metadata

Running the evaluation

Start with the official CLI:
qit bench run \
  --benchmark tau-bench \
  --split test \
  --subset retail \
  --limit 50 \
  --output results/tau_retail_test.jsonl \
  --model-name "Qwen/Qwen3-8B"
Then aggregate and inspect:
qit bench eval --input results/tau_retail_test.jsonl --json
qita board --logdir runs
The bundled tau_bench_eval.py remains available as a benchmark-specific wrapper over the same official result shape and trace contract. Run a single task:
python examples/benchmarks/tau_bench_eval.py \
  --tau-env retail \
  --tau-split test \
  --task-index 0 \
  --max-steps 30 \
  --model-name "Qwen/Qwen3-8B" \
  --api-key "$OPENAI_API_KEY"
Run the full benchmark with multiple trials:
python examples/benchmarks/tau_bench_eval.py \
  --run-all \
  --tau-env retail \
  --tau-split test \
  --num-trials 3 \
  --limit 50 \
  --concurrency 4 \
  --output-jsonl results/tau_retail_test.jsonl \
  --trace-logdir runs \
  --model-name "Qwen/Qwen3-8B" \
  --api-key "$OPENAI_API_KEY"
Resume after interruption:
python examples/benchmarks/tau_bench_eval.py \
  --run-all \
  --resume \
  --output-jsonl results/tau_retail_test.jsonl \
  --tau-env retail \
  --api-key "$OPENAI_API_KEY"
Use --num-trials 5 and --shuffle to compute pass^k metrics. Each trial uses a different random seed derived from --seed.

How the runtime works

TauRuntimeEnv is a minimal drop-in for the upstream Tau environment. It exposes a reset / step / calculate_reward interface:
from qitos.benchmark.tau_bench.runtime import get_tau_runtime_env

env = get_tau_runtime_env(env_name="retail", task_split="test", task_index=0)
reset_response = env.reset()
print(reset_response.observation)  # Customer instruction

# Each tool call goes through env.step()
response = env.step(TauAction(name="get_order", kwargs={"order_id": "O123"}))
print(response.observation)  # Tool result
print(response.done)         # True when terminated
print(response.reward)       # 1.0 or 0.0 (only set when done=True)
Reward is computed by replaying the ground-truth action sequence on a fresh data state and comparing its hash to the agent’s final data state. A reward of 1.0 requires both the correct state hash and all expected output strings present in agent responses.

Task structure

Task(
    id="tau_retail_test_00000",
    objective="I need to cancel order O123 and get a refund.",
    inputs={
        "benchmark": "tau-bench",
        "env": "retail",
        "split": "test",
        "instruction": "I need to cancel order O123 and get a refund.",
        "reference_outputs": ["Your order has been cancelled."],
        "reference_actions": [{"name": "cancel_order", "kwargs": {"order_id": "O123"}}],
        "user_id": "U42",
    },
    env_spec=EnvSpec(
        type="tau_bench",
        capabilities=["tau.step", "tau.reward", "tau.tool_call"],
    ),
    budget=TaskBudget(max_steps=30),
)

Expected output

Each result line in the output JSONL file contains:
{
  "task_id": "tau_retail_test_00000",
  "idx": 0,
  "trial": 0,
  "env": "retail",
  "split": "test",
  "reward": 1.0,
  "success": true,
  "eval_score": 1.0,
  "stop_reason": "final",
  "steps": 12,
  "latency_seconds": 8.4
}
After a full run, the script prints aggregate metrics aligned with the upstream tau-bench evaluation:
[Tau-Bench] Metrics (aligned with tau-bench run.py)
- avg_reward: 0.74
- pass^k:
  - k=1: 0.74
  - k=3: 0.81
  - k=5: 0.86
- reward_success_rate: 0.74
- mean_steps: 14.2
- stop_reason_distribution: {"final": 0.9, "max_steps": 0.1}
Inspect individual traces with qita:
qita board --logdir runs
qita replay --run runs/qitos_tau_retail_00000_trial0_20250101_120000
The airline environment only supports the test split. Requesting train or dev will raise a ValueError.