What you will learn
- why
qit benchis the canonical benchmark entrypoint - how
RunSpecandExperimentSpecare attached automatically - what files to expect after a run
- how to evaluate and inspect results with
qita
Step 1: choose a benchmark
For a dry first pass, use Tau-Bench because it does not require external dataset download:--runner at a benchmark example wrapper or your own runner callback.
Step 2: inspect the result row
Every output line is normalized toBenchmarkRunResult.
You should expect fields like:
task_idbenchmarksplitpredictionsuccessstop_reasonstepslatency_secondstoken_usagecosttrace_run_dirrun_spec_ref
Step 3: aggregate metrics
Step 4: inspect the trace
When the run also produced a trace directory, inspect it withqita:
Step 5: verify that it is an official run
Openmanifest.json or the qita run overview and confirm these fields exist:
run_specexperiment_specofficial_rungit_shapackage_versionprompt_protocolparser_nametool_manifest
When to still use examples/benchmarks
Use the examples when you want:
- benchmark-specific agent construction
- a reference implementation for a paper-style setup
- a thin runnable wrapper that already plugs into the official result format
