Multimodal Core and Desktop Starter

QitOS v0.5 now has one complete multimodal release path:

screenshot-first multimodal input
DesktopEnv
the official desktop benchmark family
the openai_cua_agent.py baseline
qita visual inspection

The point is no longer just laying a multimodal foundation. The point is that one desktop / computer-use story now holds together end to end.

The first wire target: OpenAI-compatible chat completions

The first multimodal request shape QitOS now supports is the OpenAI-compatible chat.completions content-array format:

{
  "role": "user",
  "content": [
    {"type": "text", "text": "Inspect this screenshot."},
    {"type": "image_url", "image_url": {"url": "https://..."}}
  ]
}

QitOS also supports the same shape with:

local image files
base64 / data URLs
multiple images in one user turn

Pure-text requests still keep the previous text-only path.

The new core vocabulary

Instead of teaching each model adapter and environment its own image schema, QitOS now normalizes multimodal input and observations through a small set of shared types:

ContentBlock
MessageEnvelope
ObservationPack
GroundingMetadata
VisualTraceAsset

These shared types provide one place to define:

what counts as an image input
how screenshot observations are represented
what qita should record and display

Multimodal input does not replace parsers or protocols

The key design point is this: In QitOS:

multimodal input answers: what did the model see?
protocol (output format definition) + parser (raw-output-to-Decision converter) answer: how must the model respond?
tool schema answers: what can the model call?

Visual input works alongside the protocols you already know:

react_text_v1
json_decision_v1
xml_decision_v1
terminus_*
minimax_tool_call_v1

This means we can combine:

screenshot input
OpenAI-compatible multimodal wire format
existing JSON/XML/native-tool-call parsers

without inventing a separate vision-only runtime.

Screenshot-first environment support

The first built-in multimodal environment is ScreenshotEnv. It provides a minimal, benchmark-agnostic path for:

a screenshot observation
optional DOM / OCR / accessibility hints
mock GUI control hooks

The result is a stable SDK surface before heavier GUI environments are plugged in.

from qitos.kit import ScreenshotEnv

env = ScreenshotEnv(
    screenshot_path="screen.png",
    text="This screenshot shows a login page.",
    dom={"title": "Login"},
    accessibility_tree={"role": "window"},
)

Example: `visual_inspect_agent.py`

The first baseline example is:

examples/real/visual_inspect_agent.py

It demonstrates a very small but complete path:

create a screenshot-backed task
expose a screenshot-first env
call an OpenAI-compatible multimodal model
keep the response on the existing JSON decision lane
inspect the run in qita with visual assets visible

This example is a better starting point for visual-web / GUI research than jumping straight into a benchmark runner.

qita support in v0.5

qita now records and shows:

whether a step had screenshot-backed observations
visual asset metadata
the current step’s observation modalities
whether the model input included images
screenshot timeline cards
replay screenshot preview
basic action overlays
grounding metadata visibility

Why this matters for OSWorld-style work

OSWorld and similar GUI benchmarks need more than image input:

screenshot observations
structured actions
grounding metadata
traceable visual state

The current phase gives QitOS a clean foundation for that work:

multimodal messages are now first-class
screenshot observations now fit the Engine/trace (structured run log)/qita model
GUI capability hooks now exist in the environment contract

So the next OSWorld adapter can be built on top of the kernel (the core AgentModule + Engine loop) instead of beside it.

Current boundary

v0.5 still does not promise:

full official OSWorld parity
full visual replay depth planned for v0.6
broad provider parity across every multimodal runtime

The release bar is narrower and clearer: one strong desktop starter path

​Multimodal Core and Desktop Starter

​The first wire target: OpenAI-compatible chat completions

​The new core vocabulary

​Multimodal input does not replace parsers or protocols

​Screenshot-first environment support

​Example: visual_inspect_agent.py

​qita support in v0.5

​Why this matters for OSWorld-style work

​Current boundary