From the Desk of Doc Holiday >

How to Document a Production Incident While It's Happening

Learn how to capture structured incident documentation in real-time without slowing response. A dedicated scribe role and live logging prevent fuzzy timelines and enable faster, more accurate postmortems.
May 31, 2026
The Doc Holiday Team
How to Document a Production Incident While It's Happening

The Slack channel is on fire. Alerts are stacking up. The engineer who built the payment service three years ago is the only person who knows where the logs live, and she is currently elbow-deep in the database trying to figure out why transactions stopped processing at 2:47 a.m.

Someone in the thread asks: "Can someone keep a log of what we're trying?"

Nobody answers. Everyone is busy. The log doesn't get kept.

Person surrounded by chaos asking who is documenting, with sticky note reminder never to be checked.
The log that nobody writes is the one everyone needs three days later.

This is how most incident documentation actually works. Not because teams are negligent, but because the person best positioned to document what's happening is the person needed most to fix it. The cognitive load during an active incident is already at its ceiling. Stress measurably impairs memory encoding, which means the engineers who will later be asked to reconstruct a precise timeline are the same people whose brains were least equipped to form one. The documentation problem isn't a culture problem. It's a physics problem.

The answer isn't to document less. It's to document differently, in a way that doesn't compete with the response.

The Person Who Should Be Documenting Is Also the Person Fixing the Problem

Most incident response frameworks focus almost entirely on resolution. The Google SRE Book's managing incidents chapter is a useful exception: it describes a dedicated "communications" role whose job is to keep the incident document accurate and up to date while the operations team focuses on the system. PagerDuty's incident response documentation calls this role the scribe.

The scribe doesn't fix anything. They narrate.

For major incidents, this separation of responsibilities is the most important structural decision you can make. The incident commander holds the high-level state. The ops lead applies the tools. The scribe keeps the record. When everyone is doing all three at once, all three get done badly.

For smaller incidents where you don't have the headcount to assign a dedicated scribe, the principle still applies: whoever is documenting should be doing only that. Even if it's the same person rotating between roles, the documentation window and the debugging window should not overlap.

Anyway. The question is what actually goes in the log.

What Actually Needs to Go in the Log

The goal during an active incident isn't literary quality. It's structured capture. Rough notes beat no notes by a wide margin.

The minimum viable record includes:

  • Timestamps for every significant state change: when the incident was detected, when the first engineer joined, when each mitigation was attempted, when service was restored
  • The exact sequence of diagnostic steps and what each one revealed
  • Configuration changes, rollbacks, or deployments made during the response
  • External communications: what was sent to customers or leadership, and when
  • Hypotheses that turned out to be wrong (these matter as much as the ones that were right)

That last item is the one teams consistently skip. When the postmortem happens three days later, everyone remembers the fix. Nobody remembers the four things they tried before the fix, or why they ruled them out. Those dead ends are often where the most useful systemic information lives.

Rootly's analysis of incident timelines makes the point clearly: sequence confusion is one of the most common causes of flawed root cause analysis. Teams misremember the order of actions. They attribute cause to an event that happened after the real root issue began. A precise, timestamped log prevents this. An 11-minute gap between when an alert fired and when it was acknowledged is invisible in memory but obvious in a log, and it might be the most actionable finding in the entire postmortem.

The format matters less than the habit. A shared Google Doc, a dedicated Slack channel where engineers narrate their actions, a structured incident template in whatever tool your team already uses: any of these work. The key is that the log is open and live from the moment the incident is declared, not assembled afterward from fragments.

Vertical timeline showing five incident documentation categories in 1970s poster style.
The four things you tried before the fix are often where the real learning lives.

NIST SP 800-61r3, the federal standard for computer security incident handling, explicitly recommends maintaining a detailed log of all actions taken during incident response. GDPR Article 33 requires controllers to document any personal data breach "comprising the facts relating to the personal data breach, its effects and the remedial action taken." SOC 2 auditors expect to see evidence that incidents were tracked and reviewed. The compliance case for real-time documentation is straightforward. But the operational case is stronger: teams that capture clean timelines spend dramatically less time on postmortems.

Incident.io's analysis found that manual timeline reconstruction consumes 60 to 90 minutes per incident as engineers search through chat history, monitoring dashboards, and call recordings trying to piece together what happened. At 15 incidents a month, that's 270 hours a year spent on administrative reconstruction. The postmortem becomes archaeology, and archaeology is expensive.

When the Notes Become the Record

Once the incident is resolved, the real-time log becomes the source material for everything that follows.

Customer-facing incident reports need accurate timelines. Internal postmortems need the sequence of diagnostic steps. Compliance documentation needs evidence of what was done and when. Engineering retrospectives need the hypotheses that were wrong, not just the fix that was right. Trend analysis across incidents needs consistent structure to be useful at all.

Atlassian's blameless postmortem guidance makes the point that building a shared timeline before the postmortem meeting is essential. A misunderstanding of the core issue can derail the entire review. The Pragmatic Engineer's survey of incident practices found that most companies target completing postmortem analysis within 36 to 48 hours of resolution. That window is tight. If the timeline has to be reconstructed from memory inside that window, quality degrades fast.

When the log is solid, the postmortem is mostly editing. When the log is missing, the postmortem is guesswork dressed up as analysis.

The Google SRE Incident Management Guide describes the postmortem as a tool for learning from outages and improving systems to prevent similar incidents. That learning only happens if the data is there. "Left unchecked," the guide notes, "outages tend to regularly resurface and accumulate over time." Real-time documentation is how you break that cycle.

For teams that have built the habit of capturing structured data during incidents, Doc Holiday provides the infrastructure to turn those notes into consistent, auditable artifacts automatically, pulling from pull requests, deployment logs, Slack threads, and monitoring alerts without requiring someone to write it up from scratch afterward.

The scribe role is still valuable. The log still needs to be kept. But when the raw material is there, the finished document mostly takes care of itself.

time to Get your docs in a row.

Begin your free trial and and start your Doc Holiday today!