Multimodal Core and Desktop Starter
QitOS v0.5 now has one complete multimodal release path:- screenshot-first multimodal input
DesktopEnv- the official
desktopbenchmark family - the
openai_cua_agent.pybaseline - qita visual inspection
The first wire target: OpenAI-compatible chat completions
The first multimodal request shape QitOS now supports is the OpenAI-compatiblechat.completions content-array format:
- local image files
- base64 / data URLs
- multiple images in one user turn
The new core vocabulary
Instead of teaching each model adapter and environment its own image schema, QitOS now normalizes multimodal input and observations through a small set of shared types:ContentBlockMessageEnvelopeObservationPackGroundingMetadataVisualTraceAsset
- what counts as an image input
- how screenshot observations are represented
- what qita should record and display
Multimodal input does not replace parsers or protocols
This is the most important design point. In QitOS:- multimodal input answers: what did the model see?
- protocol + parser answer: how must the model respond?
- tool schema answers: what can the model call?
react_text_v1json_decision_v1xml_decision_v1terminus_*minimax_tool_call_v1
- screenshot input
- OpenAI-compatible multimodal wire format
- existing JSON/XML/native-tool-call parsers
Screenshot-first environment support
The first built-in multimodal environment isScreenshotEnv.
It provides a minimal, benchmark-agnostic path for:
- a screenshot observation
- optional DOM / OCR / accessibility hints
- mock GUI control hooks
Example: visual_inspect_agent.py
The first baseline example is:
/Users/morinop/coding/yoga_framework/examples/real/visual_inspect_agent.py
- create a screenshot-backed task
- expose a screenshot-first env
- call an OpenAI-compatible multimodal model
- keep the response on the existing JSON decision lane
- inspect the run in qita with visual assets visible
qita support in v0.5
qita now records and shows:- whether a step had screenshot-backed observations
- visual asset metadata
- the current step’s observation modalities
- whether the model input included images
- screenshot timeline cards
- replay screenshot preview
- basic action overlays
- grounding metadata visibility
Why this matters for OSWorld-style work
OSWorld and similar GUI benchmarks need more than image input:- screenshot observations
- structured actions
- grounding metadata
- traceable visual state
- multimodal messages are now first-class
- screenshot observations now fit the Engine/trace/qita model
- GUI capability hooks now exist in the environment contract
Current boundary
v0.5 still does not promise:- full official OSWorld parity
- full visual replay depth planned for v0.6
- broad provider parity across every multimodal runtime
