Skip to main content

Third-Party Benchmark Integration

QitOS now treats benchmark integration as a first-class SDK surface. If you are adding a new benchmark family, the default shape is:
  • qitos.benchmark.<family> for dataset loading, benchmark runtime, evaluator, scorer, and benchmark-native artifacts
  • qitos.recipes.* for reproducible baseline methods
  • examples/* only for thin user-facing entrypoints
Do not put benchmark-specific setup, scoring, or dataset logic into qitos/core, DesktopEnv, or random example files.

Layer boundaries

Framework layer

Keep these concerns in the framework:
  • AgentModule + Engine
  • ActionSpace
  • EnvironmentAdapter
  • DesktopEnv
  • provider-neutral tool and action vocabularies
  • family presets and harness ownership
  • qita replay, export, compare, screenshot timeline, and overlays
Framework code must stay benchmark-agnostic.

Benchmark layer

Put these concerns in qitos.benchmark.<family>:
  • dataset loading and split logic
  • stable sample identity
  • benchmark runtime prepare/finalize hooks
  • benchmark-specific setup / postconfig
  • evaluator bridge
  • scorer and failure taxonomy
  • benchmark-native artifact payloads
If it depends on dataset files like test_all.json, benchmark-native VM/bootstrap inputs, or upstream evaluator semantics, it belongs here.

Recipe layer

Put baseline methods in qitos.recipes:
  • canonical starter baselines
  • benchmark baselines
  • reproducible comparison methods
Recipes should be reusable from:
  • qit bench
  • docs/tutorials
  • thin examples
  • future report scripts

Required directory shape

For a new benchmark family, create:
qitos/benchmark/<family>/
├── __init__.py
├── adapter.py
├── runtime.py
├── evaluator.py
├── scorer.py
└── runner.py
Optional files are fine, but these roles should be visible and easy to review. If the benchmark has a canonical baseline, also add:
qitos/recipes/benchmarks/<family>.py

Adapter contract

The adapter owns:
  • dataset root resolution
  • record loading
  • split and subset filtering
  • stable task/sample identity
  • task metadata normalization
Each task should carry enough metadata for:
  • benchmark-native evaluation
  • qita inspection
  • reproducible result export
Minimum metadata expectations:
  • benchmark
  • split
  • stable sample identity such as task_id, example_id, or equivalent
  • raw benchmark metadata needed by runtime/evaluator layers

Runtime hook contract

Use BenchmarkRuntimeHook when the benchmark needs:
  • environment prepare/finalize
  • benchmark-specific setup before the agent acts
  • bootstrap metadata
  • cleanup policy
Examples:
  • OSWorld qcow2/bootstrap and controller readiness
  • benchmark-specific sandbox setup
  • service warmup or post-task teardown
Do not move this logic into DesktopEnv or global engine code unless it is genuinely reusable across benchmark families.

Evaluator and scorer contract

Use BenchmarkEvaluator for benchmark-native payloads. Examples:
  • upstream evaluator bridge output
  • benchmark-native score JSON
  • postconfig execution results
Use BenchmarkScorer to map that evaluation payload onto the normalized public row:
  • success
  • stop_reason
  • steps
  • latency_seconds
  • token_usage
  • cost
  • benchmark-specific metadata
The public row should always remain a BenchmarkRunResult, even when the benchmark keeps richer native payloads in metadata.

Normalized result expectations

Every benchmark run should still produce the shared public row contract:
  • task_id
  • benchmark
  • split
  • prediction
  • success
  • stop_reason
  • steps
  • latency_seconds
  • token_usage
  • cost
  • trace_run_dir
  • run_spec_ref
Benchmark-specific extras belong in metadata, not in a second public result schema.

Trace and qita compatibility

Your benchmark family should preserve:
  • RunSpec
  • ExperimentSpec
  • trace directory compatibility
  • qita replay / export / compare
If you add benchmark-native artifacts, make sure they can still be understood alongside:
  • manifest.json
  • events.jsonl
  • steps.jsonl
The benchmark should add detail, not break the shared run contract.

CLI and registration expectations

To make the benchmark official:
  1. Add the family to qitos.benchmark
  2. Register loading and builtin runner resolution in qitos.benchmark.runner
  3. Make it runnable through:
    • qit bench run
    • qit bench eval
    • qit bench replay
    • qit bench export
Examples may remain, but they should only be thin wrappers over the recipe and runner layers.

Documentation checklist

When adding a benchmark family, update all of:
  • benchmark overview docs
  • one family page
  • CLI reference if benchmark names or strategies changed
  • contributor docs if the integration adds a new runtime/evaluator pattern
  • CHANGELOG.md
  • README progress/news when the change is user-visible
That sync work is part of the implementation, not a follow-up.

Current reference examples

Use these in the repo as reference implementations:
  • qitos.benchmark.desktop for a starter benchmark family
  • qitos.benchmark.osworld for a real benchmark adapter path
  • qitos.benchmark.gaia
  • qitos.benchmark.tau_bench
  • qitos.benchmark.cybench
  • qitos.recipes.desktop.osworld_starter
  • qitos.recipes.benchmarks.gaia
  • qitos.recipes.benchmarks.tau_bench
  • qitos.recipes.benchmarks.cybench
These are the canonical shapes future integrations should follow.