OSWorld Benchmark Adapter
QitOS now separates three layers clearly:- framework:
DesktopEnv,ActionSpace,EnvironmentAdapter, qita visual debugging - benchmark:
qitos.benchmark.osworld - recipe:
qitos.recipes.desktop.osworld_starter
osworld is no longer implied by the desktop-starter benchmark name.
What lives in qitos.benchmark.osworld
The OSWorld benchmark family owns benchmark-relevant concerns:
- dataset loading from
test_all.jsonand domain/example JSON files - sample identity and benchmark metadata normalization
- benchmark runtime prepare/finalize hooks
- OSWorld-specific setup / postconfig lifecycle
- evaluator bridge to upstream reference metrics/getters
- scorer output and benchmark-native runtime artifacts
qitos/core or the generic desktop environment.
What does not live here
These remain framework-level:- provider-neutral GUI action vocabulary
DesktopEnv- multimodal observation contracts
- qita screenshot timeline / replay / overlay
- family preset ownership of protocol/parser/native tool calling
Canonical usage
The starter benchmark remains:desktop is still accepted as a compatibility alias for desktop-starter, but desktop-starter is now the canonical benchmark name.
Current expectation
osworld is now an official benchmark family in QitOS, but it should still be understood as a benchmark adapter layer, not as a claim that every desktop runtime detail has reached full OSWorld parity.