Third-Party Benchmark Integration
QitOS now treats benchmark integration as a first-class SDK surface. If you are adding a new benchmark family, the default shape is:qitos.benchmark.<family>for dataset loading, benchmark runtime, evaluator, scorer, and benchmark-native artifactsqitos.recipes.*for reproducible baseline methodsexamples/*only for thin user-facing entrypoints
qitos/core, DesktopEnv, or random example files.
Layer boundaries
Framework layer
Keep these concerns in the framework:AgentModule + EngineActionSpaceEnvironmentAdapterDesktopEnv- provider-neutral tool and action vocabularies
- family presets and harness ownership
- qita replay, export, compare, screenshot timeline, and overlays
Benchmark layer
Put these concerns inqitos.benchmark.<family>:
- dataset loading and split logic
- stable sample identity
- benchmark runtime
prepare/finalizehooks - benchmark-specific
setup/postconfig - evaluator bridge
- scorer and failure taxonomy
- benchmark-native artifact payloads
test_all.json, benchmark-native VM/bootstrap inputs, or upstream evaluator semantics, it belongs here.
Recipe layer
Put baseline methods inqitos.recipes:
- canonical starter baselines
- benchmark baselines
- reproducible comparison methods
qit bench- docs/tutorials
- thin examples
- future report scripts
Required directory shape
For a new benchmark family, create:Adapter contract
The adapter owns:- dataset root resolution
- record loading
- split and subset filtering
- stable task/sample identity
- task metadata normalization
- benchmark-native evaluation
- qita inspection
- reproducible result export
benchmarksplit- stable sample identity such as
task_id,example_id, or equivalent - raw benchmark metadata needed by runtime/evaluator layers
Runtime hook contract
UseBenchmarkRuntimeHook when the benchmark needs:
- environment prepare/finalize
- benchmark-specific setup before the agent acts
- bootstrap metadata
- cleanup policy
- OSWorld qcow2/bootstrap and controller readiness
- benchmark-specific sandbox setup
- service warmup or post-task teardown
DesktopEnv or global engine code unless it is genuinely reusable across benchmark families.
Evaluator and scorer contract
UseBenchmarkEvaluator for benchmark-native payloads.
Examples:
- upstream evaluator bridge output
- benchmark-native score JSON
- postconfig execution results
BenchmarkScorer to map that evaluation payload onto the normalized public row:
successstop_reasonstepslatency_secondstoken_usagecost- benchmark-specific metadata
BenchmarkRunResult, even when the benchmark keeps richer native payloads in metadata.
Normalized result expectations
Every benchmark run should still produce the shared public row contract:task_idbenchmarksplitpredictionsuccessstop_reasonstepslatency_secondstoken_usagecosttrace_run_dirrun_spec_ref
metadata, not in a second public result schema.
Trace and qita compatibility
Your benchmark family should preserve:RunSpecExperimentSpec- trace directory compatibility
- qita replay / export / compare
manifest.jsonevents.jsonlsteps.jsonl
CLI and registration expectations
To make the benchmark official:- Add the family to
qitos.benchmark - Register loading and builtin runner resolution in
qitos.benchmark.runner - Make it runnable through:
qit bench runqit bench evalqit bench replayqit bench export
Documentation checklist
When adding a benchmark family, update all of:- benchmark overview docs
- one family page
- CLI reference if benchmark names or strategies changed
- contributor docs if the integration adds a new runtime/evaluator pattern
CHANGELOG.md- README progress/news when the change is user-visible
Current reference examples
Use these in the repo as reference implementations:qitos.benchmark.desktopfor a starter benchmark familyqitos.benchmark.osworldfor a real benchmark adapter pathqitos.benchmark.gaiaqitos.benchmark.tau_benchqitos.benchmark.cybenchqitos.recipes.desktop.osworld_starterqitos.recipes.benchmarks.gaiaqitos.recipes.benchmarks.tau_benchqitos.recipes.benchmarks.cybench
