- what model and parser were used?
- what tool surface was exposed?
- what benchmark split was this?
- can I replay the trajectory?
- can I diff this run against the previous one?
qita review path.
The point is not bureaucracy. The point is that prompt work, parser work, tool work, and benchmark work become much easier to trust once every run is exportable, replayable, and comparable.
If your team wants to move fast on agent research, reproducible runs are not a side feature. They are the memory of the project.