Skip to main content

OSWorld Benchmark Adapter

QitOS now separates three layers clearly:
  • framework: DesktopEnv, ActionSpace, EnvironmentAdapter, qita visual debugging
  • benchmark: qitos.benchmark.osworld
  • recipe: qitos.recipes.desktop.osworld_starter
That means osworld is no longer implied by the desktop-starter benchmark name.

What lives in qitos.benchmark.osworld

The OSWorld benchmark family owns benchmark-relevant concerns:
  • dataset loading from test_all.json and domain/example JSON files
  • sample identity and benchmark metadata normalization
  • benchmark runtime prepare/finalize hooks
  • OSWorld-specific setup / postconfig lifecycle
  • evaluator bridge to upstream reference metrics/getters
  • scorer output and benchmark-native runtime artifacts
These pieces should not be pushed into qitos/core or the generic desktop environment.

What does not live here

These remain framework-level:
  • provider-neutral GUI action vocabulary
  • DesktopEnv
  • multimodal observation contracts
  • qita screenshot timeline / replay / overlay
  • family preset ownership of protocol/parser/native tool calling

Canonical usage

The starter benchmark remains:
qit bench run \
  --benchmark desktop-starter \
  --split starter \
  --strategy desktop_smoke \
  --output ./artifacts/desktop-starter.jsonl
The real benchmark family is now a separate path:
qit bench run \
  --benchmark osworld \
  --split test \
  --root /path/to/OSWorld/evaluation_examples \
  --strategy osworld_baseline \
  --model-family qwen \
  --model-name qwen-plus \
  --output ./artifacts/osworld.jsonl
desktop is still accepted as a compatibility alias for desktop-starter, but desktop-starter is now the canonical benchmark name.

Current expectation

osworld is now an official benchmark family in QitOS, but it should still be understood as a benchmark adapter layer, not as a claim that every desktop runtime detail has reached full OSWorld parity.