Skip to main content

Desktop Starter Benchmark

desktop-starter is the first official multimodal starter benchmark family in QitOS. It is intentionally scoped as an OSWorld-compatible starter:
  • desktop / computer-use task structure
  • screenshot + a11y + OCR + UI candidates
  • provider-neutral GUI actions
  • unified BenchmarkRunResult rows
  • qita replay / export / visual inspection
It does not claim full official OSWorld parity yet.

Run the starter benchmark

qit bench run \
  --benchmark desktop-starter \
  --split starter \
  --strategy desktop_baseline \
  --model-name qwen-plus \
  --model-family qwen \
  --base-url https://dashscope.aliyuncs.com/compatible-mode/v1 \
  --output ./artifacts/desktop-starter.jsonl
For a deterministic local smoke run:
qit bench run \
  --benchmark desktop-starter \
  --split starter \
  --strategy desktop_smoke \
  --output ./artifacts/desktop-starter-smoke.jsonl

What the starter benchmark measures

Each result row includes the standard benchmark fields plus desktop-specific metadata:
  • success / stop reason
  • step count
  • action count
  • critic count
  • token usage
  • latency
  • failure tags
Current failure taxonomy:
  • perception_failure
  • grounding_failure
  • planning_failure
  • action_selection_failure
  • execution_environment_failure
  • stop_completion_failure

What makes this an official v0.5 path

The desktop starter benchmark is the first multimodal path where all of these now line up:
  • benchmark tasks
  • baseline agent
  • unified runner output
  • trace artifacts
  • qita visual inspection
  • docs/tutorial story
That is the release bar for v0.5. The real benchmark adapter now lives separately under osworld.