OSWorld Benchmark Adapter

QitOS separates three layers:

framework: DesktopEnv, ActionSpace, EnvironmentAdapter, qita visual debugging
benchmark: qitos.benchmark.osworld
recipe: qitos.recipes.desktop.osworld_starter

So osworld is no longer implied by the desktop-starter benchmark name.

What lives in `qitos.benchmark.osworld`

The OSWorld benchmark family owns benchmark-relevant concerns:

dataset loading from test_all.json and domain/example JSON files
sample identity and benchmark metadata normalization
benchmark runtime prepare/finalize hooks
OSWorld-specific setup / postconfig lifecycle
evaluator bridge to upstream reference metrics/getters
scorer output and benchmark-native runtime artifacts

These pieces should not be pushed into qitos/core or the generic desktop environment.

What does not live here

These remain framework-level:

provider-neutral GUI action vocabulary
DesktopEnv
multimodal observation contracts
qita screenshot timeline / replay / overlay
family preset ownership of protocol, parser, and native tool calling

Canonical usage

The starter benchmark remains:

qit bench run \
  --benchmark desktop-starter \
  --split starter \
  --strategy desktop_smoke \
  --output ./artifacts/desktop-starter.jsonl

The real benchmark family is now a separate path:

qit bench run \
  --benchmark osworld \
  --split test \
  --root /path/to/OSWorld/evaluation_examples \
  --strategy osworld_baseline \
  --model-family qwen \
  --model-name qwen-plus \
  --output ./artifacts/osworld.jsonl

desktop is still accepted as a compatibility alias for desktop-starter, but desktop-starter is now the canonical benchmark name.

Current expectation

osworld is now an official benchmark family in QitOS, but it should still be understood as a benchmark adapter layer, not as a claim that every desktop runtime detail has reached full OSWorld parity.

Documentation Index

​OSWorld Benchmark Adapter

​What lives in qitos.benchmark.osworld

​What does not live here

​Canonical usage

​Current expectation

OSWorld Benchmark Adapter

What lives in `qitos.benchmark.osworld`

What does not live here

Canonical usage

Current expectation