Skip to main content
QitOS v0.5 now has a first official provider-neutral computer-use lane for desktop and GUI work. This slice is inspired by the original OSWorld architecture and task loop, but it is implemented in QitOS-native pieces:
  • DesktopEnv for screenshot + accessibility + terminal observations
  • qitos.kit.tool.gui for atomic GUI actions
  • ComputerUseToolSet / computer_use_tools() for composition-first authoring
  • desktop_actions_json_v1 and desktop_actions_xml_v1 for protocol-aware scaffolding
  • OpenAI-compatible multimodal image input for the current screenshot turn

Why this lane exists

The official OpenAI computer-use APIs are useful, but they are also provider-specific. QitOS takes a different default path for research:
  • keep the model input on the existing OpenAI-compatible image-input lane,
  • keep the output contract in QitOS protocols and parsers,
  • keep the desktop runtime in a provider-neutral environment adapter.
That means the same computer-use harness can be used with:
  • OpenAI-compatible multimodal APIs,
  • open-source models that only understand JSON or XML scaffolding,
  • the official desktop-starter benchmark starter and the separate osworld benchmark adapter path.

Core pieces

DesktopEnv

Use DesktopEnv when you want an OSWorld-style desktop environment with QitOS contracts.
from qitos.kit.env import DesktopEnv

env = DesktopEnv.from_mock(
    screenshot_path="/tmp/desktop.png",
    instruction="Click the visible Continue button.",
    accessibility_tree={"role": "window", "name": "Demo"},
)
Current provider layers:
  • MockDesktopProvider for smoke runs and deterministic examples
  • ContainerDesktopProvider for container-first desktop kernels

GUI tools

Atomic GUI tools live under qitos.kit.tool.gui.
from qitos.kit.tool.gui import Click, TypeText, Hotkey
The action vocabulary aligns with OSWorld-style desktop actions:
  • move_to
  • click
  • mouse_down
  • mouse_up
  • right_click
  • double_click
  • drag_to
  • scroll
  • type_text
  • press_key
  • key_down
  • key_up
  • hotkey
  • wait
  • done
  • fail

Composition-first toolset

Most users should not register each GUI tool by hand. Start from the preset bundle:
from qitos.kit.toolset import computer_use_tools

registry = computer_use_tools()
Or stay on the list-first authoring path:
from qitos.kit import ComputerUseToolSet

agent = MyAgent(
    toolset=[ComputerUseToolSet()],
    llm=model,
    model_protocol="desktop_actions_json_v1",
)

Protocol choice

QitOS keeps multimodal input and output scaffolding as separate concerns.
  • multimodal input answers: “what can the model see?”
  • protocol/parser answers: “how should the model respond?”
For desktop work, QitOS now ships two protocol presets:
  • desktop_actions_json_v1
  • desktop_actions_xml_v1
Use JSON first when the model is comfortable with structured JSON. Use XML when the model tends to follow tag-based contracts more reliably.
agent = MyAgent(
    toolset=[ComputerUseToolSet()],
    llm=model,
    model_protocol="desktop_actions_xml_v1",
)
This is the same design idea used elsewhere in QitOS: adapt the scaffolding to the model, instead of pretending one parser/prompt shape fits every family.

Example: openai_cua_agent.py

The main reference example is:
  • /Users/morinop/coding/yoga_framework/examples/real/openai_cua_agent.py
It keeps the file name close to OSWorld’s original openai_cua_agent.py so the lineage is easy to follow, but the implementation is intentionally QitOS-native. The actual baseline implementation now lives in the recipe layer:
  • /Users/morinop/coding/yoga_framework/qitos/recipes/desktop/osworld_starter.py
The public example file is now only a thin entrypoint:
  • /Users/morinop/coding/yoga_framework/examples/real/openai_cua_agent.py
It is now the benchmark-grade starter baseline, not just a thin demo loop:
  • current-step screenshot goes into the model via OpenAI-compatible multimodal messages
  • GUI actions are returned as QitOS JSON/XML decisions
  • GUI execution goes through ComputerUseToolSet
  • environment state refresh goes through DesktopEnv
  • the baseline prompt and state implement planner / grounding / action-selector discipline
  • critic retries can reject obviously weak actions before the run continues
Smoke run:
python examples/real/openai_cua_agent.py
If you want the smallest environment-only loop, use:
python examples/real/desktop_env_smoke.py

Official benchmark path

The official v0.5 entrypoint is:
qit bench run \
  --benchmark desktop-starter \
  --split starter \
  --strategy desktop_smoke \
  --output ./artifacts/desktop-starter.jsonl

Container-first direction

The desktop lane is designed to be container-first. That keeps the future OSWorld-style adapter direction aligned with:
  • reproducible desktop state
  • provider isolation
  • benchmark-friendly env lifecycles
The current implementation now ships the first official desktop starter benchmark path, the separate osworld benchmark adapter layer, and the qita visual inspection surface. It still does not claim full OSWorld parity or full v0.6 replay depth.