Skip to main content

Run Your First Desktop Benchmark

This tutorial shows the full v0.5 desktop path:
  1. run the official desktop-starter benchmark
  2. inspect the normalized result rows
  3. open the run in qita

1. Run the starter benchmark

qit bench run \
  --benchmark desktop-starter \
  --split starter \
  --strategy desktop_smoke \
  --output ./artifacts/desktop-starter.jsonl
When you want a real model:
qit bench run \
  --benchmark desktop-starter \
  --split starter \
  --strategy desktop_baseline \
  --model-family qwen \
  --model-name qwen-plus \
  --base-url https://dashscope.aliyuncs.com/compatible-mode/v1 \
  --output ./artifacts/desktop-starter.jsonl

2. Evaluate the result rows

qit bench eval --input ./artifacts/desktop-starter.jsonl --json
Look for:
  • success rate
  • stop reasons
  • failure tag distribution
  • average step count

3. Inspect the run in qita

qita board --logdir ./runs
In qita, inspect:
  • the screenshot timeline
  • the current step screenshot + overlay
  • the chosen desktop action
  • whether grounding metadata existed
  • whether the critic forced retries
That is the core v0.5 research loop: benchmark -> trace -> visual failure analysis. When you want the real benchmark adapter rather than the starter pack, switch to --benchmark osworld.