TaskRunSpecExperimentSpecTraceWriterBenchmarkRunResultqita
Supported benchmarks
| Benchmark | Domain | Primary metric |
|---|---|---|
| Desktop Starter | Computer use starter baseline | Success / failure taxonomy |
| OSWorld | Desktop / computer-use benchmark adapter | OSWorld evaluator score |
| GAIA | General AI assistant tasks | Exact match |
| Tau-Bench | Tool-agent-user interaction | Reward / pass^k |
| CyBench | CTF-style security evaluation | Guided subtask score |
Canonical benchmark path
The official CLI is:examples/benchmarks/ still exist, but in v0.3 they are thin wrappers over the same result and trace contract.
QitOS now also keeps benchmark work in three explicit layers:
- framework: shared runtime, env, harness, and qita capabilities
- benchmark: dataset/runtime/evaluator/scorer integration under
qitos.benchmark.* - recipe: reproducible baseline methods under
qitos.recipes.*
examples/.
Why this matters
Because all benchmark outputs share one shape, you can:- compare runs across benchmarks
- aggregate summary metrics without per-benchmark glue code
- use the same replay and export surface everywhere
- reason about benchmark regressions with the same qita workflow as normal runs
- keep starter benchmarks and real benchmark adapters separate without changing the artifact contract
What a benchmark run produces
A benchmark run may produce two artifact layers:- a trace directory with
manifest.json,events.jsonl, andsteps.jsonl - a JSONL file of normalized
BenchmarkRunResultrows
task_idbenchmarksplitpredictionsuccessstop_reasonstepslatency_secondstoken_usagecosttrace_run_dirrun_spec_ref
