Skip to main content
This final lesson answers the hardest question in the course: can the same kernel power a serious domain agent without collapsing into custom orchestration? The answer in QitOS is yes, but only if you put domain logic in the right places. You will study examples/real/code_security_audit_agent.py.

What changes from lesson 3

BranchClaude Code-style lessonSecurity audit lesson
GoalModify code and verify a patchInspect code, collect evidence, and rank findings
Tool surfaceGeneral coding presetSecurity audit tools + codebase tools + task board
Prompt policyCoding workflow disciplineAudit protocol and evidence discipline
StateTodos and modeScratchpad and ranked findings
Success conditionPassing verification commandHigh-signal final audit report
qita usageDebugging long-running behaviorProducing a review artifact

The system prompt now teaches an audit protocol

The lesson uses SECURITY_AUDIT_SYSTEM_PROMPT, which says things like:
Primary objective:
- Audit the repository for meaningful security risk, not just keyword matches.
- Use tools to collect evidence before making strong claims.

Judgment rules:
- Treat tool output as evidence, not proof.
- Separate results into:
  1. confirmed issue
  2. high-value lead
  3. human review needed
- Prefer a small number of high-signal findings over a long noisy list.
This is the final step in the course’s prompt design ladder:
  • lesson 1: parser contract
  • lesson 2: planner versus executor contracts
  • lesson 3: workflow discipline
  • lesson 4: domain judgment protocol
The runtime still has not changed.

The parser and harness stay stable

The audit agent still uses:
model_parser=ReActTextParser()
and a text-first OpenAI-compatible model harness. That stability is important. It shows that domain specialization does not require a new protocol by default. You should only move to a model-specific harness if the domain actually benefits from it. For example:
  • use JSON/XML contracts when you require stricter machine-readable outputs
  • use a native tool-call parser when your provider produces structured tool calls more reliably than text
  • use Terminus-style protocols when the agent is controlling an interactive terminal rather than calling repository tools directly
QitOS supports those options, but the default research path remains provider-agnostic text ReAct.
1

Compose the tool surface by domain

The lesson combines three tool families:
super().__init__(
    toolset=[
        SecurityAuditToolSet(
            workspace_root=workspace_root,
            include_external=False,
            max_matches=80,
        ),
        CodingToolSet(
            workspace_root=workspace_root,
            include_notebook=False,
            enable_lsp=False,
            enable_tasks=False,
            enable_web=False,
            expose_legacy_aliases=True,
            expose_modern_names=False,
            profile="codebase",
        ),
        TaskToolSet(workspace_root=workspace_root),
    ],
    llm=llm,
    model_parser=ReActTextParser(),
)
This is the capstone lesson in tool composition.The tool surface now has layers:
  • domain reasoning tools from SecurityAuditToolSet
  • low-level repository inspection from the codebase profile of CodingToolSet
  • explicit progress tracking from TaskToolSet
This is how QitOS wants you to specialize agents:by composing the right environment, not by writing a new loop
2

Encode the audit method in prompt plus prepare

The prompt provides the audit discipline, and prepare() makes the run stage explicit:
lines = [
    f"Audit task: {state.task}",
    f"Workspace: {WORKSPACE}",
    f"Step: {state.current_step}/{state.max_steps}",
    "Suggested flow: inventory -> entrypoints -> sinks/secrets/config/dependencies -> hotspots -> final ranked findings.",
]
This is an important pattern:
  • the system prompt defines the global audit standard
  • prepare() defines the local current-step framing
3

Track findings as first-class state

The audit state is intentionally lean:
@dataclass
class SecurityAuditState(StateSchema):
    scratchpad: list[str] = field(default_factory=list)
    findings: list[str] = field(default_factory=list)
That is the right design for this domain.The next model step does not need:
  • every grep result
  • every file listing
  • every intermediate tool payload
It needs:
  • the recent audit trajectory
  • the strongest candidate findings so far
4

Use reduce to rank and compress evidence

The example extracts only the highest-signal findings from tool output:
if isinstance(first, dict):
    data = (
        first.get("data", {})
        if isinstance(first.get("data", {}), dict)
        else {}
    )
    for item in list(data.get("findings", []) or [])[:3]:
        title = str(item.get("title", "finding"))
        location = f"{item.get('file', '?')}:{item.get('line', '?')}"
        state.findings.append(f"{title} @ {location}")
    if decision.mode == "final":
        state.final_result = str(decision.final_answer or "")
This is the domain-specialized version of the same core lesson:traces keep the raw evidence state keeps the compact working memory
5

Use bounded history, not unlimited accumulation

The example runs with:
history_policy=HistoryPolicy(max_messages=14)
That is a strong default for this audit:
  • enough room for recent reasoning and evidence
  • not enough room for the model to keep re-reading every old search result verbatim
If you extend this into a much larger audit, the next upgrade is usually CompactHistory, not unbounded history.
6

Choose memory only if it changes the audit outcome

The example does not attach a separate memory adapter.That is correct for a short tutorial audit because:
  • findings already acts as compact state memory
  • qita preserves the full trace for later review
  • a separate retrieval layer would add complexity without improving the lesson
Add memory when the agent needs one of these:
  • durable cross-run findings
  • semantic retrieval over previous audits
  • long-lived notes that should not stay in the immediate prompt
In that case, you would choose among:
  • WindowMemory for recent rolling records
  • SummaryMemory for compressed rolling recall
  • VectorMemory for semantic retrieval
  • MarkdownFileMemory for durable, inspectable storage
7

Use qita as a review artifact, not just a debugger

Run:
python examples/real/code_security_audit_agent.py
Inspect:
qita board --logdir runs
In this lesson, qita is doing more than debugging.Use it to inspect:
  • whether the audit started with inventory before jumping to conclusions
  • which findings were promoted into state.findings
  • whether parser diagnostics stayed clean
  • whether context pressure changed the quality of the audit
  • whether the final answer reads like a ranked review, not a dump of raw matches

The final design rule of the course

By the end of lesson 4, the course should make one rule feel obvious: domain logic belongs in:
  • state design
  • prompt policy
  • tool composition
  • reduce() semantics
It does not belong in a separate hidden runtime.

Whitzard and model-native scaffolding

The tutorial example above keeps the most portable path:
  • text-first prompt contract
  • prompt-injected tool schema
  • ReActTextParser
That is still the right place to start. But QitOS is not limited to that pairing. If you open examples/real/whitzard_agent.py, you can see the next design idea in the course: model and scaffolding should sometimes be designed together. Whitzard is useful because it makes this concrete. In practice, some models do not naturally prefer the same tool-call format. MiniMax is a good example: depending on provider and training prior, it often emits native XML-like tool calls such as:
<minimax:tool_call>
  <invoke name="send_terminal_keys">
    <parameter name="keystrokes">pwd</parameter>
    <parameter name="submit">true</parameter>
  </invoke>
</minimax:tool_call>
If you keep forcing a pure JSON contract in that setting, the model may spend extra effort fighting its own native habits. The agent can still work, but the fit is worse. That is why QitOS treats these choices as a coordinated protocol decision rather than as isolated knobs:
  • parser
  • tool schema style
  • output contract
  • repair path
With Whitzard, users can keep the default model-native path and let QitOS choose a MiniMax-oriented protocol, or they can explicitly choose a different scaffolding shape when they want stricter control. For example:
  • keep the model-native protocol when MiniMax tool calls are the most reliable output
  • switch to a Terminus XML-style contract when you want a more explicit XML parser and XML-oriented tool schema
  • switch to a Terminus JSON-style contract when your model follows JSON contracts reliably enough to justify the stricter shape
In the example, this is exposed through the protocol layer rather than through a custom runtime rewrite. Whitzard still uses the same kernel ideas you have learned in the course:
  • state
  • prepare()
  • reduce()
  • tool composition
  • qita traces
What changes is the interaction protocol. That is the important design lesson. QitOS makes this easy because protocol-aware scaffolding is built into the framework:
  • the model profile can select a default protocol automatically
  • the tool schema renderer can match that protocol
  • the parser can match that schema
  • parser diagnostics and repair flow still go through the same observability stack
So the point of Whitzard is not only that it is a stronger audit agent. It also teaches a broader idea: when a model has a strong native tool-calling prior, you often get a better agent by adapting the scaffolding to the model instead of forcing every model through the same contract. That is one of the reasons QitOS keeps parser choice, tool schema choice, and prompt/protocol choice explicit and composable.

Full example

The full runnable lesson lives at:

Where to go next

Build your own agent

Use the full design worksheet from the course to design your own AgentModule.

Kit reference

Look up parsers, prompts, toolsets, memory, and history helpers used across the course.

Observability

Deepen your qita workflow for replay, export, and research-grade sharing.

Benchmarks overview

Apply the same kernel to GAIA, Tau-Bench, and CyBench.