Computer Use and Desktop Env

QitOS v0.5 now has a first official provider-neutral computer-use lane (a complete execution path for GUI and desktop interaction) for desktop and GUI work. This lane is inspired by the original OSWorld architecture and task loop, but it is implemented in QitOS-native pieces:

DesktopEnv for screenshot + accessibility + terminal observations (the data returned by the environment after each action)
qitos.kit.tool.gui for atomic GUI actions
ComputerUseToolSet / computer_use_tools() for composition-first authoring
desktop_actions_json_v1 and desktop_actions_xml_v1 for protocol-aware (each protocol defines the output format the model must follow) scaffolding
OpenAI-compatible multimodal image input for the current screenshot turn

Why this lane exists

The official OpenAI computer-use APIs are useful, but they are also provider-specific. QitOS takes a different default path for research:

keep the model input on the existing OpenAI-compatible image-input lane,
keep the output contract in QitOS protocols (output format definitions) and parsers (components that convert raw model output into structured Decisions),
keep the desktop runtime in a provider-neutral environment adapter.

The same computer-use harness (the configured adapter + protocol + toolset combination) can be used with:

OpenAI-compatible multimodal APIs,
open-source models that only understand JSON or XML scaffolding,
the official desktop-starter benchmark starter and the separate osworld benchmark adapter path.

Core pieces

`DesktopEnv`

Use DesktopEnv when you want an OSWorld-style desktop environment with QitOS contracts.

from qitos.kit.env import DesktopEnv

env = DesktopEnv.from_mock(
    screenshot_path="/tmp/desktop.png",
    instruction="Click the visible Continue button.",
    accessibility_tree={"role": "window", "name": "Demo"},
)

Current provider layers:

MockDesktopProvider for smoke runs and deterministic examples
ContainerDesktopProvider for container-first desktop kernels

GUI tools

Atomic GUI tools live under qitos.kit.tool.gui.

from qitos.kit.tool.gui import Click, TypeText, Hotkey

The action vocabulary aligns with OSWorld-style desktop actions:

move_to
click
mouse_down
mouse_up
right_click
double_click
drag_to
scroll
type_text
press_key
key_down
key_up
hotkey
wait
done
fail

Composition-first toolset

Most users should not register each GUI tool by hand. Start from the preset bundle:

from qitos.kit.toolset import computer_use_tools

registry = computer_use_tools()

Or stay on the list-first authoring path:

from qitos.kit import ComputerUseToolSet

agent = MyAgent(
    toolset=[ComputerUseToolSet()],
    llm=model,
    model_protocol="desktop_actions_json_v1",
)

Protocol choice

QitOS keeps multimodal input and output scaffolding as separate concerns.

multimodal input answers: what can the model see?
protocol/parser answers: how should the model respond?

For desktop work, QitOS now ships two protocol presets:

desktop_actions_json_v1
desktop_actions_xml_v1

Use JSON first when the model is comfortable with structured JSON. Use XML when the model tends to follow tag-based contracts more reliably.

agent = MyAgent(
    toolset=[ComputerUseToolSet()],
    llm=model,
    model_protocol="desktop_actions_xml_v1",
)

The design principle is the same one used elsewhere in QitOS: adapt the scaffolding to the model, instead of assuming one parser/prompt shape fits every family.

Example: `openai_cua_agent.py`

The main reference example is:

examples/real/openai_cua_agent.py

It keeps the file name close to OSWorld’s original openai_cua_agent.py so the lineage is easy to follow, but the implementation is intentionally QitOS-native. The actual baseline implementation now lives in the recipe layer:

qitos/recipes/desktop/osworld_starter.py

The public example file is now only a thin entrypoint:

examples/real/openai_cua_agent.py

It is now the benchmark-grade starter baseline, not just a thin demo loop:

current-step screenshot goes into the model via OpenAI-compatible multimodal messages
GUI actions are returned as QitOS JSON/XML decisions
GUI execution goes through ComputerUseToolSet
environment state refresh goes through DesktopEnv
the baseline prompt and state implement planner / grounding / action-selector discipline
critic (a step-level validator) retries can reject obviously weak actions before the run continues

Smoke run:

python examples/real/openai_cua_agent.py

If you want the smallest environment-only loop, use:

python examples/real/desktop_env_smoke.py

Official benchmark path

The official v0.5 entrypoint is:

qit bench run \
  --benchmark desktop-starter \
  --split starter \
  --strategy desktop_smoke \
  --output ./artifacts/desktop-starter.jsonl

Container-first direction

The desktop lane is designed to be container-first. This keeps the future OSWorld-style adapter direction aligned with:

reproducible desktop state
provider isolation
benchmark-friendly env lifecycles

The current implementation now ships the first official desktop starter benchmark path, the separate osworld benchmark adapter layer, and the qita visual inspection surface. It still does not claim full OSWorld parity or full v0.6 replay depth.

​Why this lane exists

​Core pieces

​DesktopEnv

​GUI tools

​Composition-first toolset

​Protocol choice

​Example: openai_cua_agent.py

​Official benchmark path

​Container-first direction