Skip to main content

Documentation Index

Fetch the complete documentation index at: https://qitor.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Benchmark work in QitOS runs on the same runtime as normal agent runs. Benchmark execution, replay, export, and result aggregation all sit on top of the same core primitives:
  • Task
  • RunSpec (metadata describing how a single run was configured)
  • ExperimentSpec (metadata grouping runs into an experiment)
  • TraceWriter (writes structured run events and steps to disk)
  • BenchmarkRunResult (a normalized result row for one benchmark task)
  • qita

Supported benchmarks

BenchmarkDomainPrimary metric
Desktop StarterComputer use starter baselineSuccess / failure taxonomy
OSWorldDesktop / computer-use benchmark adapterOSWorld evaluator score
GAIAGeneral AI assistant tasksExact match
Tau-BenchTool-agent-user interactionReward / pass^k
CyBenchCTF-style security evaluationGuided subtask score

Canonical benchmark path

The official CLI is:
qit bench run ...
qit bench eval ...
qit bench replay ...
qit bench export ...
The scripts under examples/benchmarks/ still exist, but in v0.3 they are thin wrappers over the same result and trace contract. Benchmark work is organized in three explicit layers:
  • framework: shared runtime, env, harness (the model-facing wiring layer), and qita capabilities
  • benchmark: dataset/runtime/evaluator/scorer integration under qitos.benchmark.*
  • recipe: reproducible baseline methods under qitos.recipes.*
This split lets starter benchmarks, real benchmark adapters, and reusable baselines coexist without leaking into examples/.

Why this matters

Because all benchmark outputs share one shape, you can:
  • compare runs across benchmarks
  • aggregate summary metrics without per-benchmark glue code
  • use the same replay and export surface everywhere
  • reason about benchmark regressions with the same qita workflow as normal runs
  • keep starter benchmarks and real benchmark adapters separate without changing the artifact contract

What a benchmark run produces

A benchmark run may produce two artifact layers (persistent output files from a run):
  1. a trace directory with manifest.json, events.jsonl, and steps.jsonl
  2. a JSONL file of normalized BenchmarkRunResult rows
Each result row includes fields such as:
  • task_id
  • benchmark
  • split
  • prediction
  • success
  • stop_reason
  • steps
  • latency_seconds
  • token_usage
  • cost
  • trace_run_dir
  • run_spec_ref

Example

qit bench run \
  --benchmark tau-bench \
  --split test \
  --subset retail \
  --limit 10 \
  --output ./results/tau_retail_test.jsonl \
  --model-name "Qwen/Qwen3-8B"
Then aggregate and inspect:
qit bench eval --input ./results/tau_retail_test.jsonl --json
qita board --logdir ./runs