DesktopEnvfor screenshot + accessibility + terminal observationsqitos.kit.tool.guifor atomic GUI actionsComputerUseToolSet/computer_use_tools()for composition-first authoringdesktop_actions_json_v1anddesktop_actions_xml_v1for protocol-aware scaffolding- OpenAI-compatible multimodal image input for the current screenshot turn
Why this lane exists
The official OpenAI computer-use APIs are useful, but they are also provider-specific. QitOS takes a different default path for research:- keep the model input on the existing OpenAI-compatible image-input lane,
- keep the output contract in QitOS protocols and parsers,
- keep the desktop runtime in a provider-neutral environment adapter.
- OpenAI-compatible multimodal APIs,
- open-source models that only understand JSON or XML scaffolding,
- the official
desktop-starterbenchmark starter and the separateosworldbenchmark adapter path.
Core pieces
DesktopEnv
Use DesktopEnv when you want an OSWorld-style desktop environment with QitOS contracts.
MockDesktopProviderfor smoke runs and deterministic examplesContainerDesktopProviderfor container-first desktop kernels
GUI tools
Atomic GUI tools live underqitos.kit.tool.gui.
move_toclickmouse_downmouse_upright_clickdouble_clickdrag_toscrolltype_textpress_keykey_downkey_uphotkeywaitdonefail
Composition-first toolset
Most users should not register each GUI tool by hand. Start from the preset bundle:Protocol choice
QitOS keeps multimodal input and output scaffolding as separate concerns.- multimodal input answers: “what can the model see?”
- protocol/parser answers: “how should the model respond?”
desktop_actions_json_v1desktop_actions_xml_v1
Example: openai_cua_agent.py
The main reference example is:
/Users/morinop/coding/yoga_framework/examples/real/openai_cua_agent.py
openai_cua_agent.py so the lineage is easy to follow, but the implementation is intentionally QitOS-native.
The actual baseline implementation now lives in the recipe layer:
/Users/morinop/coding/yoga_framework/qitos/recipes/desktop/osworld_starter.py
/Users/morinop/coding/yoga_framework/examples/real/openai_cua_agent.py
- current-step screenshot goes into the model via OpenAI-compatible multimodal messages
- GUI actions are returned as QitOS JSON/XML decisions
- GUI execution goes through
ComputerUseToolSet - environment state refresh goes through
DesktopEnv - the baseline prompt and state implement planner / grounding / action-selector discipline
- critic retries can reject obviously weak actions before the run continues
Official benchmark path
The official v0.5 entrypoint is:Container-first direction
The desktop lane is designed to be container-first. That keeps the future OSWorld-style adapter direction aligned with:- reproducible desktop state
- provider isolation
- benchmark-friendly env lifecycles
osworld benchmark adapter layer, and the qita visual inspection surface.
It still does not claim full OSWorld parity or full v0.6 replay depth.