QitOS v0.5 now has a first official provider-neutral computer-use lane (a complete execution path for GUI and desktop interaction) for desktop and GUI work. This lane is inspired by the original OSWorld architecture and task loop, but it is implemented in QitOS-native pieces:Documentation Index
Fetch the complete documentation index at: https://qitor.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
DesktopEnvfor screenshot + accessibility + terminal observations (the data returned by the environment after each action)qitos.kit.tool.guifor atomic GUI actionsComputerUseToolSet/computer_use_tools()for composition-first authoringdesktop_actions_json_v1anddesktop_actions_xml_v1for protocol-aware (each protocol defines the output format the model must follow) scaffolding- OpenAI-compatible multimodal image input for the current screenshot turn
Why this lane exists
The official OpenAI computer-use APIs are useful, but they are also provider-specific. QitOS takes a different default path for research:- keep the model input on the existing OpenAI-compatible image-input lane,
- keep the output contract in QitOS protocols (output format definitions) and parsers (components that convert raw model output into structured Decisions),
- keep the desktop runtime in a provider-neutral environment adapter.
- OpenAI-compatible multimodal APIs,
- open-source models that only understand JSON or XML scaffolding,
- the official
desktop-starterbenchmark starter and the separateosworldbenchmark adapter path.
Core pieces
DesktopEnv
Use DesktopEnv when you want an OSWorld-style desktop environment with QitOS contracts.
MockDesktopProviderfor smoke runs and deterministic examplesContainerDesktopProviderfor container-first desktop kernels
GUI tools
Atomic GUI tools live underqitos.kit.tool.gui.
move_toclickmouse_downmouse_upright_clickdouble_clickdrag_toscrolltype_textpress_keykey_downkey_uphotkeywaitdonefail
Composition-first toolset
Most users should not register each GUI tool by hand. Start from the preset bundle:Protocol choice
QitOS keeps multimodal input and output scaffolding as separate concerns.- multimodal input answers: what can the model see?
- protocol/parser answers: how should the model respond?
desktop_actions_json_v1desktop_actions_xml_v1
Example: openai_cua_agent.py
The main reference example is:
/Users/morinop/coding/yoga_framework/examples/real/openai_cua_agent.py
openai_cua_agent.py so the lineage is easy to follow, but the implementation is intentionally QitOS-native.
The actual baseline implementation now lives in the recipe layer:
/Users/morinop/coding/yoga_framework/qitos/recipes/desktop/osworld_starter.py
/Users/morinop/coding/yoga_framework/examples/real/openai_cua_agent.py
- current-step screenshot goes into the model via OpenAI-compatible multimodal messages
- GUI actions are returned as QitOS JSON/XML decisions
- GUI execution goes through
ComputerUseToolSet - environment state refresh goes through
DesktopEnv - the baseline prompt and state implement planner / grounding / action-selector discipline
- critic (a step-level validator) retries can reject obviously weak actions before the run continues
Official benchmark path
The official v0.5 entrypoint is:Container-first direction
The desktop lane is designed to be container-first. This keeps the future OSWorld-style adapter direction aligned with:- reproducible desktop state
- provider isolation
- benchmark-friendly env lifecycles
osworld benchmark adapter layer, and the qita visual inspection surface.
It still does not claim full OSWorld parity or full v0.6 replay depth.