Documentation Index
Fetch the complete documentation index at: https://qitor.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Tau-Bench measures how well an agent can act as a customer service representative — reading a policy wiki, using data-access tools, and satisfying a simulated user request without violating business rules.
Tasks are graded by comparing the final database state against a ground-truth action sequence (an action is a normalized tool invocation emitted by the policy). An agent scores 1.0 only when its tool calls produce exactly the right state transitions and its responses contain all required output values.
QitOS ships TauBenchAdapter and a self-contained runtime (TauRuntimeEnv) so you can run Tau-Bench without installing the upstream tau_bench package. All task data, tools, wiki, and rules are vendored under qitos.benchmark.tau_bench.port.
Environments
| Environment | Splits available | Terminate tool |
|---|
retail | train, dev, test | transfer_to_human_agents |
airline | test | transfer_to_human_agents |
Setup
Install benchmark dependencies
pip install "qitos[benchmarks]"
Set your model API key
export OPENAI_API_KEY="sk-..."
Tau-Bench task data is vendored inside the QitOS package. You do not need to download any external dataset.
Loading tasks
from qitos.benchmark import TauBenchAdapter
adapter = TauBenchAdapter(env_name="retail", task_split="test")
records = adapter.load_records()
tasks = adapter.to_tasks(records, split="test", limit=5)
print(tasks[0].id) # "tau_retail_test_00000"
print(tasks[0].objective) # Customer service instruction
print(tasks[0].inputs["env"]) # "retail"
print(tasks[0].inputs["reference_outputs"]) # Expected response strings
One-line convenience loader:
from qitos.benchmark.tau_bench.adapter import load_tau_bench_tasks
tasks = load_tau_bench_tasks(env_name="retail", split="test", limit=10)
Configuration
TauBenchAdapter accepts the following parameters:
| Parameter | Default | Description |
|---|
env_name | "retail" | Environment to load: "retail" or "airline" |
task_split | "test" | Split to load: "train", "dev", or "test" |
default_max_steps | 30 | Step budget per task |
include_raw_record | True | Attach raw task dict to task.metadata |
Running the evaluation
Start with the official CLI:
qit bench run \
--benchmark tau-bench \
--split test \
--subset retail \
--limit 50 \
--output results/tau_retail_test.jsonl \
--model-name "Qwen/Qwen3-8B"
Then aggregate and inspect:
qit bench eval --input results/tau_retail_test.jsonl --json
qita board --logdir runs
The bundled tau_bench_eval.py remains available as a benchmark-specific wrapper over the same official result shape and trace contract.
Run a single task:
python examples/benchmarks/tau_bench_eval.py \
--tau-env retail \
--tau-split test \
--task-index 0 \
--max-steps 30 \
--model-name "Qwen/Qwen3-8B" \
--api-key "$OPENAI_API_KEY"
Run the full benchmark with multiple trials:
python examples/benchmarks/tau_bench_eval.py \
--run-all \
--tau-env retail \
--tau-split test \
--num-trials 3 \
--limit 50 \
--concurrency 4 \
--output-jsonl results/tau_retail_test.jsonl \
--trace-logdir runs \
--model-name "Qwen/Qwen3-8B" \
--api-key "$OPENAI_API_KEY"
Resume after interruption:
python examples/benchmarks/tau_bench_eval.py \
--run-all \
--resume \
--output-jsonl results/tau_retail_test.jsonl \
--tau-env retail \
--api-key "$OPENAI_API_KEY"
Use --num-trials 5 and --shuffle to compute pass^k metrics. Each trial uses a different random seed derived from --seed.
How the runtime works
TauRuntimeEnv is a minimal drop-in for the upstream Tau environment. It exposes a reset / step / calculate_reward interface:
from qitos.benchmark.tau_bench.runtime import get_tau_runtime_env
env = get_tau_runtime_env(env_name="retail", task_split="test", task_index=0)
reset_response = env.reset()
print(reset_response.observation) # The environment's response after an action or reset
# Each tool call goes through env.step()
response = env.step(TauAction(name="get_order", kwargs={"order_id": "O123"}))
print(response.observation) # Tool result
print(response.done) # True when terminated
print(response.reward) # 1.0 or 0.0 (only set when done=True)
Reward is computed by replaying the ground-truth action sequence on a fresh data state and comparing its hash to the agent’s final data state. A reward of 1.0 requires both the correct state hash and all expected output strings present in agent responses.
Task structure
Task(
id="tau_retail_test_00000",
objective="I need to cancel order O123 and get a refund.",
inputs={
"benchmark": "tau-bench",
"env": "retail",
"split": "test",
"instruction": "I need to cancel order O123 and get a refund.",
"reference_outputs": ["Your order has been cancelled."],
"reference_actions": [{"name": "cancel_order", "kwargs": {"order_id": "O123"}}],
"user_id": "U42",
},
env_spec=EnvSpec(
type="tau_bench",
capabilities=["tau.step", "tau.reward", "tau.tool_call"],
),
budget=TaskBudget(max_steps=30),
)
Expected output
Each result line in the output JSONL file contains:
{
"task_id": "tau_retail_test_00000",
"idx": 0,
"trial": 0,
"env": "retail",
"split": "test",
"reward": 1.0,
"success": true,
"eval_score": 1.0,
"stop_reason": "final",
"steps": 12,
"latency_seconds": 8.4
}
After a full run, the script prints aggregate metrics aligned with the upstream tau-bench evaluation:
[Tau-Bench] Metrics (aligned with tau-bench run.py)
- avg_reward: 0.74
- pass^k:
- k=1: 0.74
- k=3: 0.81
- k=5: 0.86
- reward_success_rate: 0.74
- mean_steps: 14.2
- stop_reason_distribution: {"final": 0.9, "max_steps": 0.1}
Inspect individual traces with qita:
qita board --logdir runs
qita replay --run runs/qitos_tau_retail_00000_trial0_20250101_120000
The airline environment only supports the test split. Requesting train or dev will raise a ValueError.