Tutorial: Replay and Inspect a Failed Run

This tutorial starts after a run already exists. The question is no longer whether it finished. The question is why it behaved this way, and what changed between two runs.

Step 1: open the board

qita board --logdir ./runs

The board is the fastest way to see:

stop reason
step count
event count
token usage
parser warnings
official-run and replay metadata

Step 2: open one failed run

Pick a run with stop_reason=max_steps, exception, or obvious parser trouble. Then open:

qita replay --run ./runs/<run_id>

In the run overview, check these first:

official run
replay mode
git SHA
package
seed
prompt protocol
parser

This tells you whether the run is comparable before you even read step content.

Step 3: inspect parser and context telemetry

In the run page, look for:

parser diagnostics
context occupancy timeline
compaction markers
model response summaries

This usually tells you whether the failure came from:

a protocol mismatch
poor tool choice
context saturation
benchmark setup failure

Step 4: compare two runs

Use the board compare controls or open the diff route directly:

/compare?left=RUN_A&right=RUN_B

The v0.3 diff view focuses on the highest-signal fields:

stop reason
final result
step count
event count
token usage
latency
cost
parser diagnostics
first failure step
run config diff

This is the fastest way to answer “what actually changed?”

Step 5: export what matters

When you need to share a failure with a collaborator:

qit bench export --run ./runs/<run_id> --html ./reports/failed_run.html

This keeps the investigation tied to the same trace artifact (a persistent output file produced by a run) instead of screenshots or hand-written notes.

Best-effort replay reminder

Replay in QitOS is currently best effort. It is strong enough for:

research debugging
benchmark review
prompt/parser regression analysis
artifact sharing

It is not a guarantee that a remote provider or external environment will reproduce identical tokens forever.

Documentation Index

​Step 1: open the board

​Step 2: open one failed run

​Step 3: inspect parser and context telemetry

​Step 4: compare two runs

​Step 5: export what matters

​Best-effort replay reminder

​Next step