What Should Go in an Incident Timeline Document


It's 3:00 AM on a Tuesday, and your primary database cluster just started rejecting connections.
By 3:15 AM, the customer support queue is overflowing. By 3:45 AM, you've mitigated the issue by failing over to a read replica and restarting the primary. You're exhausted, your coffee is cold, and you just want to go back to sleep.
But before you can close your laptop, you have to write the incident timeline.
Incident timelines are one of the few pieces of engineering documentation that have real teeth. They are read by executives during post-mortems, scrutinized by auditors during compliance reviews, and poured over by engineers trying to understand what actually broke.

A weak timeline wastes everyone's time. A good one prevents the next fire.
The timeline is the factual spine of your incident response. It is not the post-mortem. The post-mortem is the analysis of why things happened and what you're changing — a distinction worth preserving. The timeline is the raw material: what happened, in order.
Conflating the two makes both weaker.
What actually needs to go in that document is a sequence of factual events with timestamps, quantified impact metrics, specific actions taken, communication events, the root cause if known, and the remediation steps.
We all know we should write them, but we seem to have trouble figuring out what actually belongs in them.
The Spine of the Document
The core of your timeline is a chronological list of events. What happened, and exactly when did it happen?
Always log events in UTC. Or provide extremely clear timezone context. When your distributed systems span multiple regions and your on-call engineers are spread across three continents, local time zones are a recipe for confusion.
Your timeline must include the critical state changes of the incident. Detection time (when did you first know something was wrong?). Escalation points (when was the incident commander paged?). Mitigation attempts (what did you try, and did it make things worse?). Resolution (when did the system return to normal operating parameters?).
Be precise. "The cache layer failed" is a useful entry. "Things started looking weird around 2 PM" is not.
The Reality of the Impact
If you can't quantify the impact, you don't really know what happened.
Your timeline needs to capture the user-facing reality of the incident. How many customers were affected? Was revenue at risk? What percentage of requests were failing, and which specific services were degraded?
Include both the peak impact and how that impact changed over time as you attempted mitigations.
If hard numbers aren't immediately available, describe the observable symptoms that mattered to users. A timeline that notes "Checkout page load times exceeded 30 seconds, causing timeouts for an estimated 40% of users" provides necessary context for the decisions that follow.
Decision-Making Under Pressure
An incident timeline is a record of decision-making under pressure. You need to document who did what to investigate, mitigate, or resolve the issue.
Don't just write "we restarted the service." Write who made the call, what data informed that decision, and what happened as a result. Did you restart the service because memory usage was spiking, or because it's just the first thing you always try?
Capturing this context is essential for the later post-mortem analysis.
This is where you often have to reconstruct events from Slack threads, PagerDuty logs, and engineer memory. That's fine. Just be explicit about your sources and acknowledge the gaps.
A timeline that admits "we don't have exact timestamps for the failover command" is more credible than one that guesses.

The Flow of Information
Incidents live or die on information flow. The timeline must show where communication succeeded or failed.
When were stakeholders notified? When did the status page update? When did customer support get briefed on what to tell angry users?
The heartbeat rule of incident communication dictates regular updates, even if the update is just "we are still investigating." Your timeline should reflect this cadence.
If there was a 45-minute gap between the database failing and the status page turning red, the timeline needs to show that gap so the post-mortem can address it.
The Actual Failure
If the root cause is confirmed by the time the timeline is published, include it.
If it's a cascading failure due to a resource leak triggered by an unprecedented surge in traffic — the kind of multi-factor collapse the Google SRE Book documents in detail — state that clearly.
If the root cause is not yet confirmed, say so explicitly. Note when follow-up analysis is expected.
Never speculate in the timeline itself. The timeline is a factual document. Theories belong in the post-mortem discussion.
What Happens Next
What happened after the immediate resolution? This is what turns the incident from a stressful story into a learning artifact.
Document the configuration changes made, the patches deployed, the new monitoring added, and the process changes committed to. This ensures that the incident isn't just fixed, but prevented from recurring.
What to Leave Out
Just as important as what goes in is what stays out.
Do not use vague language like "investigated the issue." Do not include blame or speculation about individual performance. The goal is to understand systemic failures, not to punish the engineer who ran the wrong command — a principle Etsy's engineering team articulated clearly and that has since become a standard in the industry.
Leave out unverified theories presented as fact. Leave out extraneous technical detail that doesn't explain decision-making.
You don't need to paste a 400-line stack trace into the timeline if a one-sentence summary of the error suffices. Clarity and accuracy matter more than narrative drama.
The Reality of the Work
Most incident timelines are high-stakes documentation produced under time pressure, after exhausting incidents, with incomplete information.
The difference between a timeline that holds up under scrutiny and one that doesn't is structure and discipline. Knowing what to capture, how to format it, and when to stop.
Many teams now rely on automation to pull structured data directly from observability tools and communication platforms. This reduces the manual archaeology required to build the timeline. But it still requires a human to decide what is signal and what is noise, and to add the interpretive context that explains why actions were taken.
Doc Holiday generates incident timelines directly from engineering workflows and communication channels, capturing timestamped events, impact data, and actions taken as they happen. That structure gives teams a clean foundation to validate and publish, rather than reconstructing the timeline from memory and scattered logs after the fact.

