Benchmarks - QitOS

Benchmark work in QitOS runs on the same runtime as normal agent runs. Benchmark execution, replay, export, and result aggregation all sit on top of the same core primitives:

Task
RunSpec (metadata describing how a single run was configured)
ExperimentSpec (metadata grouping runs into an experiment)
TraceWriter (writes structured run events and steps to disk)
BenchmarkRunResult (a normalized result row for one benchmark task)
qita

Supported benchmarks

Benchmark	Domain	Primary metric
Desktop Starter	Computer use starter baseline	Success / failure taxonomy
OSWorld	Desktop / computer-use benchmark adapter	OSWorld evaluator score
GAIA	General AI assistant tasks	Exact match
Tau-Bench	Tool-agent-user interaction	Reward / pass^k
CyBench	CTF-style security evaluation	Guided subtask score

Canonical benchmark path

The official CLI is:

qit bench run ...
qit bench eval ...
qit bench replay ...
qit bench export ...

The scripts under examples/benchmarks/ still exist, but in v0.3 they are thin wrappers over the same result and trace contract. Benchmark work is organized in three explicit layers:

framework: shared runtime, env, harness (the model-facing wiring layer), and qita capabilities
benchmark: dataset/runtime/evaluator/scorer integration under qitos.benchmark.*
recipe: reproducible baseline methods under qitos.recipes.*

This split lets starter benchmarks, real benchmark adapters, and reusable baselines coexist without leaking into examples/.

Why this matters

Because all benchmark outputs share one shape, you can:

compare runs across benchmarks
aggregate summary metrics without per-benchmark glue code
use the same replay and export surface everywhere
reason about benchmark regressions with the same qita workflow as normal runs
keep starter benchmarks and real benchmark adapters separate without changing the artifact contract

What a benchmark run produces

A benchmark run may produce two artifact layers (persistent output files from a run):

a trace directory with manifest.json, events.jsonl, and steps.jsonl
a JSONL file of normalized BenchmarkRunResult rows

Each result row includes fields such as:

task_id
benchmark
split
prediction
success
stop_reason
steps
latency_seconds
token_usage
cost
trace_run_dir
run_spec_ref

Example

qit bench run \
  --benchmark tau-bench \
  --split test \
  --subset retail \
  --limit 10 \
  --output ./results/tau_retail_test.jsonl \
  --model-name "Qwen/Qwen3-8B"

Then aggregate and inspect:

qit bench eval --input ./results/tau_retail_test.jsonl --json
qita board --logdir ./runs

​Supported benchmarks

​Canonical benchmark path

​Why this matters

​What a benchmark run produces

​Example

​Related reading

Supported benchmarks

Canonical benchmark path

Why this matters

What a benchmark run produces

Example

Related reading