Benchmark work in QitOS runs on the same runtime as normal agent runs. Benchmark execution, replay, export, and result aggregation all sit on top of the same core primitives:Documentation Index
Fetch the complete documentation index at: https://qitor.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
TaskRunSpec(metadata describing how a single run was configured)ExperimentSpec(metadata grouping runs into an experiment)TraceWriter(writes structured run events and steps to disk)BenchmarkRunResult(a normalized result row for one benchmark task)qita
Supported benchmarks
| Benchmark | Domain | Primary metric |
|---|---|---|
| Desktop Starter | Computer use starter baseline | Success / failure taxonomy |
| OSWorld | Desktop / computer-use benchmark adapter | OSWorld evaluator score |
| GAIA | General AI assistant tasks | Exact match |
| Tau-Bench | Tool-agent-user interaction | Reward / pass^k |
| CyBench | CTF-style security evaluation | Guided subtask score |
Canonical benchmark path
The official CLI is:examples/benchmarks/ still exist, but in v0.3 they are thin wrappers over the same result and trace contract.
Benchmark work is organized in three explicit layers:
- framework: shared runtime, env, harness (the model-facing wiring layer), and qita capabilities
- benchmark: dataset/runtime/evaluator/scorer integration under
qitos.benchmark.* - recipe: reproducible baseline methods under
qitos.recipes.*
examples/.
Why this matters
Because all benchmark outputs share one shape, you can:- compare runs across benchmarks
- aggregate summary metrics without per-benchmark glue code
- use the same replay and export surface everywhere
- reason about benchmark regressions with the same qita workflow as normal runs
- keep starter benchmarks and real benchmark adapters separate without changing the artifact contract
What a benchmark run produces
A benchmark run may produce two artifact layers (persistent output files from a run):- a trace directory with
manifest.json,events.jsonl, andsteps.jsonl - a JSONL file of normalized
BenchmarkRunResultrows
task_idbenchmarksplitpredictionsuccessstop_reasonstepslatency_secondstoken_usagecosttrace_run_dirrun_spec_ref
