Third-Party Benchmark Integration

QitOS now treats benchmark integration as a first-class SDK surface (a stable, documented API contract for extensions). If you are adding a new benchmark family, the default shape is:

qitos.benchmark.<family> for dataset loading, benchmark runtime, evaluator, scorer, and benchmark-native artifacts (persistent output records from runs)
qitos.recipes.* for reproducible baseline methods (preset configurations that produce consistent results)
examples/* only for thin user-facing entrypoints

Do not put benchmark-specific setup, scoring, or dataset logic into qitos/core, DesktopEnv, or random example files.

Layer boundaries

Framework layer

Keep these concerns in the framework:

AgentModule + Engine (the kernel: the core execution loop)
ActionSpace
EnvironmentAdapter
DesktopEnv
provider-neutral tool and action vocabularies
family presets (configuration bundles for model families) and harness (adapter + protocol + toolset) ownership
qita replay, export, compare, screenshot timeline, and overlays

Framework code must stay benchmark-agnostic.

Benchmark layer

Put these concerns in qitos.benchmark.<family>:

dataset loading and split logic
stable sample identity
benchmark runtime prepare/finalize hooks
benchmark-specific setup / postconfig
evaluator bridge
scorer and failure taxonomy
benchmark-native artifact (persistent output file) payloads

If it depends on dataset files like test_all.json, benchmark-native VM/bootstrap inputs, or upstream evaluator (a component that judges task success against ground truth) semantics, it belongs here.

Recipe layer

Put baseline methods in qitos.recipes:

canonical starter baselines
benchmark baselines
reproducible comparison methods

Recipes should be reusable from:

qit bench
docs/tutorials
thin examples
future report scripts

Required directory shape

For a new benchmark family, create:

qitos/benchmark/<family>/
├── __init__.py
├── adapter.py
├── runtime.py
├── evaluator.py
├── scorer.py
└── runner.py

Optional files are fine, but these roles should be visible and easy to review. If the benchmark has a canonical baseline, also add:

qitos/recipes/benchmarks/<family>.py

Adapter contract

The adapter owns:

dataset root resolution
record loading
split and subset filtering
stable task/sample identity
task metadata normalization

Each task should carry enough metadata for:

benchmark-native evaluation
qita inspection
reproducible result export

Minimum metadata expectations:

benchmark
split
stable sample identity such as task_id, example_id, or equivalent
raw benchmark metadata needed by runtime/evaluator layers

Runtime hook contract

Use BenchmarkRuntimeHook when the benchmark needs:

environment prepare/finalize
benchmark-specific setup before the agent acts
bootstrap metadata
cleanup policy

Examples:

OSWorld qcow2/bootstrap and controller readiness
benchmark-specific sandbox setup
service warmup or post-task teardown

Do not move this logic into DesktopEnv or global engine code unless it is genuinely reusable across benchmark families.

Evaluator and scorer contract

Use BenchmarkEvaluator for benchmark-native payloads. Examples:

upstream evaluator bridge output
benchmark-native score JSON
postconfig execution results

Use BenchmarkScorer to map that evaluation payload onto the normalized public row:

success
stop_reason
steps
latency_seconds
token_usage
cost
benchmark-specific metadata

The public row should always remain a BenchmarkRunResult, even when the benchmark keeps richer native payloads in metadata.

Normalized result expectations

Every benchmark run should still produce the shared public row contract:

task_id
benchmark
split
prediction
success
stop_reason
steps
latency_seconds
token_usage
cost
trace_run_dir
run_spec_ref

Benchmark-specific extras belong in metadata, not in a second public result schema.

Trace and qita compatibility

Your benchmark family should preserve:

RunSpec
ExperimentSpec
trace (structured run log) directory compatibility
qita replay / export / compare

If you add benchmark-native artifacts, make sure they can still be understood alongside:

manifest.json
events.jsonl
steps.jsonl

The benchmark should add detail, not break the shared run contract.

CLI and registration expectations

To make the benchmark official:

Add the family to qitos.benchmark
Register loading and builtin runner resolution in qitos.benchmark.runner
Make it runnable through:
- qit bench run
- qit bench eval
- qit bench replay
- qit bench export

Examples may remain, but they should only be thin wrappers over the recipe and runner layers.

Documentation checklist

When adding a benchmark family, update all of:

benchmark overview docs
one family page
CLI reference if benchmark names or strategies changed
contributor docs if the integration adds a new runtime/evaluator pattern
CHANGELOG.md
README progress/news when the change is user-visible

This sync work is part of the implementation, not a follow-up.

Current reference examples

Use these in the repo as reference implementations:

qitos.benchmark.desktop for a starter benchmark family
qitos.benchmark.osworld for a real benchmark adapter path
qitos.benchmark.gaia
qitos.benchmark.tau_bench
qitos.benchmark.cybench
qitos.recipes.desktop.osworld_starter
qitos.recipes.benchmarks.gaia
qitos.recipes.benchmarks.tau_bench
qitos.recipes.benchmarks.cybench

These are the canonical shapes (standardized patterns) future integrations should follow.

Core Concepts

Guides

Benchmarks

Reference

Contributing

Third-Party Benchmark Integration

Third-Party Benchmark Integration

Layer boundaries

Framework layer

Benchmark layer

Recipe layer

Required directory shape

Adapter contract

Runtime hook contract

Evaluator and scorer contract

Normalized result expectations

Trace and qita compatibility

CLI and registration expectations

Documentation checklist

Current reference examples

​Third-Party Benchmark Integration

​Layer boundaries

​Framework layer

​Benchmark layer

​Recipe layer

​Required directory shape

​Adapter contract

​Runtime hook contract

​Evaluator and scorer contract

​Normalized result expectations

​Trace and qita compatibility

​CLI and registration expectations

​Documentation checklist

​Current reference examples

Third-Party Benchmark Integration

Layer boundaries

Framework layer

Benchmark layer

Recipe layer

Required directory shape

Adapter contract

Runtime hook contract

Evaluator and scorer contract

Normalized result expectations

Trace and qita compatibility

CLI and registration expectations

Documentation checklist

Current reference examples