Skip to main content

Documentation Index

Fetch the complete documentation index at: https://qitor.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Multimodal Core and Desktop Starter

QitOS v0.5 now has one complete multimodal release path:
  • screenshot-first multimodal input
  • DesktopEnv
  • the official desktop benchmark family
  • the openai_cua_agent.py baseline
  • qita visual inspection
The point is no longer just laying a multimodal foundation. The point is that one desktop / computer-use story now holds together end to end.

The first wire target: OpenAI-compatible chat completions

The first multimodal request shape QitOS now supports is the OpenAI-compatible chat.completions content-array format:
{
  "role": "user",
  "content": [
    {"type": "text", "text": "Inspect this screenshot."},
    {"type": "image_url", "image_url": {"url": "https://..."}}
  ]
}
QitOS also supports the same shape with:
  • local image files
  • base64 / data URLs
  • multiple images in one user turn
Pure-text requests still keep the previous text-only path.

The new core vocabulary

Instead of teaching each model adapter and environment its own image schema, QitOS now normalizes multimodal input and observations through a small set of shared types:
  • ContentBlock
  • MessageEnvelope
  • ObservationPack
  • GroundingMetadata
  • VisualTraceAsset
These shared types provide one place to define:
  • what counts as an image input
  • how screenshot observations are represented
  • what qita should record and display

Multimodal input does not replace parsers or protocols

The key design point is this: In QitOS:
  • multimodal input answers: what did the model see?
  • protocol (output format definition) + parser (raw-output-to-Decision converter) answer: how must the model respond?
  • tool schema answers: what can the model call?
Visual input works alongside the protocols you already know:
  • react_text_v1
  • json_decision_v1
  • xml_decision_v1
  • terminus_*
  • minimax_tool_call_v1
This means we can combine:
  • screenshot input
  • OpenAI-compatible multimodal wire format
  • existing JSON/XML/native-tool-call parsers
without inventing a separate vision-only runtime.

Screenshot-first environment support

The first built-in multimodal environment is ScreenshotEnv. It provides a minimal, benchmark-agnostic path for:
  • a screenshot observation
  • optional DOM / OCR / accessibility hints
  • mock GUI control hooks
The result is a stable SDK surface before heavier GUI environments are plugged in.
from qitos.kit import ScreenshotEnv

env = ScreenshotEnv(
    screenshot_path="screen.png",
    text="This screenshot shows a login page.",
    dom={"title": "Login"},
    accessibility_tree={"role": "window"},
)

Example: visual_inspect_agent.py

The first baseline example is:
  • /Users/morinop/coding/yoga_framework/examples/real/visual_inspect_agent.py
It demonstrates a very small but complete path:
  1. create a screenshot-backed task
  2. expose a screenshot-first env
  3. call an OpenAI-compatible multimodal model
  4. keep the response on the existing JSON decision lane
  5. inspect the run in qita with visual assets visible
This example is a better starting point for visual-web / GUI research than jumping straight into a benchmark runner.

qita support in v0.5

qita now records and shows:
  • whether a step had screenshot-backed observations
  • visual asset metadata
  • the current step’s observation modalities
  • whether the model input included images
  • screenshot timeline cards
  • replay screenshot preview
  • basic action overlays
  • grounding metadata visibility

Why this matters for OSWorld-style work

OSWorld and similar GUI benchmarks need more than image input:
  • screenshot observations
  • structured actions
  • grounding metadata
  • traceable visual state
The current phase gives QitOS a clean foundation for that work:
  • multimodal messages are now first-class
  • screenshot observations now fit the Engine/trace (structured run log)/qita model
  • GUI capability hooks now exist in the environment contract
So the next OSWorld adapter can be built on top of the kernel (the core AgentModule + Engine loop) instead of beside it.

Current boundary

v0.5 still does not promise:
  • full official OSWorld parity
  • full visual replay depth planned for v0.6
  • broad provider parity across every multimodal runtime
The release bar is narrower and clearer: one strong desktop starter path