Run Your First Desktop Benchmark

This tutorial shows the full v0.5 desktop path:

run the official desktop-starter benchmark
inspect the normalized result rows
open the run in qita

1. Run the starter benchmark

qit bench run \
  --benchmark desktop-starter \
  --split starter \
  --strategy desktop_smoke \
  --output ./artifacts/desktop-starter.jsonl

When you want a real model:

qit bench run \
  --benchmark desktop-starter \
  --split starter \
  --strategy desktop_baseline \
  --model-family qwen \
  --model-name qwen-plus \
  --base-url https://dashscope.aliyuncs.com/compatible-mode/v1 \
  --output ./artifacts/desktop-starter.jsonl

2. Evaluate the result rows

qit bench eval --input ./artifacts/desktop-starter.jsonl --json

Look for:

success rate
stop reasons
failure tag distribution
average step count

3. Inspect the run in qita

qita board --logdir ./runs

In qita, inspect:

the screenshot timeline
the current step screenshot + overlay
the chosen desktop action
whether grounding metadata existed
whether the critic (a module that evaluates each step and can trigger retries or stops) forced retries

That is the core v0.5 research loop: benchmark -> trace -> visual failure analysis. When you want the real benchmark adapter rather than the starter pack, switch to --benchmark osworld.

Tutorial: Replay and Inspect a Failed Run

Inspect a GUI Failure in qita

​Run Your First Desktop Benchmark

​1. Run the starter benchmark

​2. Evaluate the result rows

​3. Inspect the run in qita

Run Your First Desktop Benchmark

1. Run the starter benchmark

2. Evaluate the result rows

3. Inspect the run in qita