This tutorial shows the shortest path to an official QitOS benchmark run. The goal is not just to get a score. The goal is to produce a run that you can replay, diff, export, and discuss later.Documentation Index
Fetch the complete documentation index at: https://qitor.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
What you will learn
- why
qit benchis the canonical benchmark entrypoint - how
RunSpec(metadata describing how a single run was configured) andExperimentSpec(metadata grouping runs into an experiment) are attached automatically - what files to expect after a run
- how to evaluate and inspect results with
qita
Step 1: choose a benchmark
For a dry first pass, use Tau-Bench because it does not require external dataset download:--runner at a benchmark example wrapper or your own runner callback.
Step 2: inspect the result row
Every output line is normalized toBenchmarkRunResult (a standardized result row for one benchmark task).
You should expect fields like:
task_idbenchmarksplitpredictionsuccessstop_reasonstepslatency_secondstoken_usagecosttrace_run_dirrun_spec_ref
Step 3: aggregate metrics
Step 4: inspect the trace
When the run also produced a trace (a structured log of all run events and steps) directory, inspect it withqita:
Step 5: verify that it is an official run
Openmanifest.json or the qita run overview and confirm these fields exist:
run_specexperiment_specofficial_rungit_shapackage_versionprompt_protocolparser_nametool_manifest
When to still use examples/benchmarks
Use the examples when you want:
- benchmark-specific agent construction
- a reference implementation for a paper-style setup
- a thin runnable wrapper that already plugs into the official result format
