examples/real/code_security_audit_agent.py.
What changes from lesson 3
| Branch | Claude Code-style lesson | Security audit lesson |
|---|---|---|
| Goal | Modify code and verify a patch | Inspect code, collect evidence, and rank findings |
| Tool surface | General coding preset | Security audit tools + codebase tools + task board |
| Prompt policy | Coding workflow discipline | Audit protocol and evidence discipline |
| State | Todos and mode | Scratchpad and ranked findings |
| Success condition | Passing verification command | High-signal final audit report |
| qita usage | Debugging long-running behavior | Producing a review artifact |
The system prompt now teaches an audit protocol
The lesson usesSECURITY_AUDIT_SYSTEM_PROMPT, which says things like:
- lesson 1: parser contract
- lesson 2: planner versus executor contracts
- lesson 3: workflow discipline
- lesson 4: domain judgment protocol
The parser and harness stay stable
The audit agent still uses:- use JSON/XML contracts when you require stricter machine-readable outputs
- use a native tool-call parser when your provider produces structured tool calls more reliably than text
- use Terminus-style protocols when the agent is controlling an interactive terminal rather than calling repository tools directly
Compose the tool surface by domain
The lesson combines three tool families:This is the capstone lesson in tool composition.The tool surface now has layers:
- domain reasoning tools from
SecurityAuditToolSet - low-level repository inspection from the codebase profile of
CodingToolSet - explicit progress tracking from
TaskToolSet
Encode the audit method in prompt plus prepare
The prompt provides the audit discipline, and This is an important pattern:
prepare() makes the run stage explicit:- the system prompt defines the global audit standard
prepare()defines the local current-step framing
Track findings as first-class state
The audit state is intentionally lean:That is the right design for this domain.The next model step does not need:
- every grep result
- every file listing
- every intermediate tool payload
- the recent audit trajectory
- the strongest candidate findings so far
Use reduce to rank and compress evidence
The example extracts only the highest-signal findings from tool output:This is the domain-specialized version of the same core lesson:traces keep the raw evidence
state keeps the compact working memory
Use bounded history, not unlimited accumulation
The example runs with:That is a strong default for this audit:
- enough room for recent reasoning and evidence
- not enough room for the model to keep re-reading every old search result verbatim
CompactHistory, not unbounded history.Choose memory only if it changes the audit outcome
The example does not attach a separate memory adapter.That is correct for a short tutorial audit because:
findingsalready acts as compact state memoryqitapreserves the full trace for later review- a separate retrieval layer would add complexity without improving the lesson
- durable cross-run findings
- semantic retrieval over previous audits
- long-lived notes that should not stay in the immediate prompt
WindowMemoryfor recent rolling recordsSummaryMemoryfor compressed rolling recallVectorMemoryfor semantic retrievalMarkdownFileMemoryfor durable, inspectable storage
Use qita as a review artifact, not just a debugger
Run:Inspect:In this lesson,
qita is doing more than debugging.Use it to inspect:- whether the audit started with inventory before jumping to conclusions
- which findings were promoted into
state.findings - whether parser diagnostics stayed clean
- whether context pressure changed the quality of the audit
- whether the final answer reads like a ranked review, not a dump of raw matches
The final design rule of the course
By the end of lesson 4, the course should make one rule feel obvious: domain logic belongs in:- state design
- prompt policy
- tool composition
reduce()semantics
Whitzard and model-native scaffolding
The tutorial example above keeps the most portable path:- text-first prompt contract
- prompt-injected tool schema
ReActTextParser
examples/real/whitzard_agent.py, you can see the next design idea in the course:
model and scaffolding should sometimes be designed together.
Whitzard is useful because it makes this concrete. In practice, some models do not naturally prefer the same tool-call format. MiniMax is a good example: depending on provider and training prior, it often emits native XML-like tool calls such as:
- parser
- tool schema style
- output contract
- repair path
Whitzard, users can keep the default model-native path and let QitOS choose a MiniMax-oriented protocol, or they can explicitly choose a different scaffolding shape when they want stricter control. For example:
- keep the model-native protocol when MiniMax tool calls are the most reliable output
- switch to a Terminus XML-style contract when you want a more explicit XML parser and XML-oriented tool schema
- switch to a Terminus JSON-style contract when your model follows JSON contracts reliably enough to justify the stricter shape
Whitzard still uses the same kernel ideas you have learned in the course:
- state
prepare()reduce()- tool composition
qitatraces
- the model profile can select a default protocol automatically
- the tool schema renderer can match that protocol
- the parser can match that schema
- parser diagnostics and repair flow still go through the same observability stack
Whitzard is not only that it is a stronger audit agent. It also teaches a broader idea:
when a model has a strong native tool-calling prior, you often get a better agent by adapting the scaffolding to the model instead of forcing every model through the same contract.
That is one of the reasons QitOS keeps parser choice, tool schema choice, and prompt/protocol choice explicit and composable.
Full example
The full runnable lesson lives at:Where to go next
Build your own agent
Use the full design worksheet from the course to design your own AgentModule.
Kit reference
Look up parsers, prompts, toolsets, memory, and history helpers used across the course.
Observability
Deepen your qita workflow for replay, export, and research-grade sharing.
Benchmarks overview
Apply the same kernel to GAIA, Tau-Bench, and CyBench.
