Skip to main content
QitOS treats benchmark work as part of the same runtime story as normal agent runs. That means benchmark execution, replay, export, and result aggregation all sit on top of the same core primitives:
  • Task
  • RunSpec
  • ExperimentSpec
  • TraceWriter
  • BenchmarkRunResult
  • qita

Supported benchmarks

BenchmarkDomainPrimary metric
Desktop StarterComputer use starter baselineSuccess / failure taxonomy
OSWorldDesktop / computer-use benchmark adapterOSWorld evaluator score
GAIAGeneral AI assistant tasksExact match
Tau-BenchTool-agent-user interactionReward / pass^k
CyBenchCTF-style security evaluationGuided subtask score

Canonical benchmark path

The official CLI is:
qit bench run ...
qit bench eval ...
qit bench replay ...
qit bench export ...
The scripts under examples/benchmarks/ still exist, but in v0.3 they are thin wrappers over the same result and trace contract. QitOS now also keeps benchmark work in three explicit layers:
  • framework: shared runtime, env, harness, and qita capabilities
  • benchmark: dataset/runtime/evaluator/scorer integration under qitos.benchmark.*
  • recipe: reproducible baseline methods under qitos.recipes.*
That split is what lets starter benchmarks, real benchmark adapters, and reusable baselines coexist without leaking into examples/.

Why this matters

Because all benchmark outputs share one shape, you can:
  • compare runs across benchmarks
  • aggregate summary metrics without per-benchmark glue code
  • use the same replay and export surface everywhere
  • reason about benchmark regressions with the same qita workflow as normal runs
  • keep starter benchmarks and real benchmark adapters separate without changing the artifact contract

What a benchmark run produces

A benchmark run may produce two artifact layers:
  1. a trace directory with manifest.json, events.jsonl, and steps.jsonl
  2. a JSONL file of normalized BenchmarkRunResult rows
Each result row includes fields such as:
  • task_id
  • benchmark
  • split
  • prediction
  • success
  • stop_reason
  • steps
  • latency_seconds
  • token_usage
  • cost
  • trace_run_dir
  • run_spec_ref

Example

qit bench run \
  --benchmark tau-bench \
  --split test \
  --subset retail \
  --limit 10 \
  --output ./results/tau_retail_test.jsonl \
  --model-name "Qwen/Qwen3-8B"
Then aggregate and inspect:
qit bench eval --input ./results/tau_retail_test.jsonl --json
qita board --logdir ./runs