Lesson 4: Code security audit agent

This final lesson answers the hardest question in the course: can the same kernel (the core AgentModule + Engine execution loop) power a serious domain agent without collapsing into custom orchestration? The answer in QitOS is yes, but only if domain logic goes in the right places. You will study examples/real/code_security_audit_agent.py.

What changes from lesson 3

Branch	Claude Code-style lesson	Security audit lesson
Goal	Modify code and verify a patch	Inspect code, collect evidence, and rank findings
Tool surface	General coding preset	Security audit tools + codebase tools + task board
Prompt policy	Coding workflow discipline	Audit protocol and evidence discipline
State	Todos and mode	Scratchpad and ranked findings
Success condition	Passing verification command	High-signal final audit report
qita usage	Debugging long-running behavior	Producing a review artifact

The system prompt now teaches an audit protocol

The lesson uses SECURITY_AUDIT_SYSTEM_PROMPT, which says things like:

Primary objective:
- Audit the repository for meaningful security risk, not just keyword matches.
- Use tools to collect evidence before making strong claims.

Judgment rules:
- Treat tool output as evidence, not proof.
- Separate results into:
  1. confirmed issue
  2. high-value lead
  3. human review needed
- Prefer a small number of high-signal findings over a long noisy list.

This is the final step in the course’s prompt design ladder (the progression from simple parser contracts to domain judgment):

lesson 1: parser contract
lesson 2: planner versus executor contracts
lesson 3: workflow discipline
lesson 4: domain judgment protocol

The runtime still has not changed.

The parser and harness stay stable

The audit agent still uses:

model_parser=ReActTextParser()

and a text-first OpenAI-compatible model harness. That stability is important. It shows that domain specialization does not require a new protocol by default. You should only move to a model-specific harness (the wiring layer that connects a transport, parser, and protocol) if the domain actually benefits from it. For example:

use JSON/XML contracts when you require stricter machine-readable outputs
use a native tool-call parser when your provider produces structured tool calls more reliably than text
use Terminus-style protocols when the agent is controlling an interactive terminal rather than calling repository tools directly

QitOS supports those options, but the default research path remains provider-agnostic text ReAct.

Compose the tool surface by domain

The lesson combines three tool families:

super().__init__(
    toolset=[
        SecurityAuditToolSet(
            workspace_root=workspace_root,
            include_external=False,
            max_matches=80,
        ),
        CodingToolSet(
            workspace_root=workspace_root,
            include_notebook=False,
            enable_lsp=False,
            enable_tasks=False,
            enable_web=False,
            expose_legacy_aliases=True,
            expose_modern_names=False,
            profile="codebase",
        ),
        TaskToolSet(workspace_root=workspace_root),
    ],
    llm=llm,
    model_parser=ReActTextParser(),
)

This is the capstone lesson in tool composition.The tool surface now has layers:

domain reasoning tools from SecurityAuditToolSet
low-level repository inspection from the codebase profile of CodingToolSet
explicit progress tracking from TaskToolSet

This is how QitOS specializes agents:by composing the right environment, not by writing a new loop

Encode the audit method in prompt plus prepare

The prompt provides the audit discipline, and prepare() makes the run stage explicit:

lines = [
    f"Audit task: {state.task}",
    f"Workspace: {WORKSPACE}",
    f"Step: {state.current_step}/{state.max_steps}",
    "Suggested flow: inventory -> entrypoints -> sinks/secrets/config/dependencies -> hotspots -> final ranked findings.",
]

This is an important pattern:

the system prompt defines the global audit standard
prepare() defines the local current-step framing

Track findings as first-class state

The audit state is intentionally lean:

@dataclass
class SecurityAuditState(StateSchema):
    scratchpad: list[str] = field(default_factory=list)
    findings: list[str] = field(default_factory=list)

That is the right design for this domain.The next model step does not need:

every grep result
every file listing
every intermediate tool payload

It needs:

the recent audit trajectory (the sequence of observations and decisions across steps)
the strongest candidate findings so far

Use reduce to rank and compress evidence

The example extracts only the highest-signal findings from tool output:

if isinstance(first, dict):
    data = (
        first.get("data", {})
        if isinstance(first.get("data", {}), dict)
        else {}
    )
    for item in list(data.get("findings", []) or [])[:3]:
        title = str(item.get("title", "finding"))
        location = f"{item.get('file', '?')}:{item.get('line', '?')}"
        state.findings.append(f"{title} @ {location}")
    if decision.mode == "final":
        state.final_result = str(decision.final_answer or "")

This is the domain-specialized version of the same core lesson:traces (structured logs of all run events and steps) keep the raw evidence state keeps the compact working memory

Use bounded history, not unlimited accumulation

The example runs with:

history_policy=HistoryPolicy(max_messages=14)

That is a strong default for this audit:

enough room for recent reasoning and evidence
not enough room for the model to keep re-reading every old search result verbatim

If you extend this into a much larger audit, the next upgrade is usually CompactHistory (a history adapter that summarizes older context to stay within token limits), not unbounded history.

Choose memory only if it changes the audit outcome

The example does not attach a separate memory adapter.That is correct for a short tutorial audit because:

findings already acts as compact state memory
qita preserves the full trace for later review
a separate retrieval layer would add complexity without improving the lesson

Add memory when the agent needs one of these:

durable cross-run findings
semantic retrieval over previous audits
long-lived notes that should not stay in the immediate prompt

In that case, you would choose among:

WindowMemory for recent rolling records
SummaryMemory for compressed rolling recall
VectorMemory for semantic retrieval
MarkdownFileMemory for durable, inspectable storage

Use qita as a review artifact, not just a debugger

Run:

python examples/real/code_security_audit_agent.py

Inspect:

qita board --logdir runs

In this lesson, qita is doing more than debugging.Use it to inspect:

whether the audit started with inventory before jumping to conclusions
which findings were promoted into state.findings
whether parser diagnostics stayed clean
whether context pressure changed the quality of the audit
whether the final answer reads like a ranked review, not a dump of raw matches

The final design rule of the course

By the end of lesson 4, the course should make one rule feel obvious: domain logic belongs in:

state design
prompt policy
tool composition
reduce() semantics

It does not belong in a separate hidden runtime.

Whitzard and model-native scaffolding

The tutorial example above keeps the most portable path:

text-first prompt contract
prompt-injected tool schema
ReActTextParser

That is still the right place to start. But QitOS is not limited to that pairing. If you open examples/real/whitzard_agent.py, you can see the next design idea in the course: model and scaffolding should sometimes be designed together. Whitzard is useful because it makes this concrete. In practice, some models do not naturally prefer the same tool-call format. MiniMax is a good example: depending on provider and training prior, it often emits native XML-like tool calls such as:

<minimax:tool_call>
  <invoke name="send_terminal_keys">
    <parameter name="keystrokes">pwd</parameter>
    <parameter name="submit">true</parameter>
  </invoke>
</minimax:tool_call>

If you keep forcing a pure JSON contract in that setting, the model may spend extra effort fighting its own native habits. The agent can still work, but the fit is worse. These choices form a coordinated protocol (the output format the model is asked to follow) decision rather than isolated knobs:

parser
tool schema style
output contract
repair path

With Whitzard, users can keep the default model-native path and let QitOS choose a MiniMax-oriented protocol, or they can explicitly choose a different scaffolding shape when they want stricter control. For example:

keep the model-native protocol when MiniMax tool calls are the most reliable output
switch to a Terminus XML-style contract when you want a more explicit XML parser and XML-oriented tool schema
switch to a Terminus JSON-style contract when your model follows JSON contracts reliably enough to justify the stricter shape

In the example, this is exposed through the protocol layer rather than through a custom runtime rewrite. Whitzard still uses the same kernel ideas you have learned in the course:

state
prepare()
reduce()
tool composition
qita traces

What changes is the interaction protocol. That is the important design lesson. QitOS makes this easy because protocol-aware scaffolding is built into the framework:

the model profile can select a default protocol automatically
the tool schema renderer can match that protocol
the parser can match that schema
parser diagnostics and repair flow still go through the same observability stack

So the point of Whitzard is not only that it is a stronger audit agent. It also teaches a broader idea: when a model has a strong native tool-calling prior, you often get a better agent by adapting the scaffolding to the model instead of forcing every model through the same contract. QitOS keeps parser choice, tool schema choice, and prompt/protocol choice explicit and composable for exactly this reason.

Full example

The full runnable lesson lives at:

Where to go next

Build your own agent

Use the full design worksheet from the course to design your own AgentModule.

Kit reference

Look up parsers, prompts, toolsets, memory, and history helpers used across the course.

Observability

Deepen your qita workflow for replay, export, and research-grade sharing.

Benchmarks overview

Apply the same kernel to GAIA, Tau-Bench, and CyBench.

​What changes from lesson 3

​The system prompt now teaches an audit protocol

​The parser and harness stay stable

​The final design rule of the course

​Whitzard and model-native scaffolding

​Full example

​Where to go next