If you’ve ever been in the situation where “something is down” and everyone is asking for updates while you’re still trying to understand what’s happening, you already know the core challenge of a major incident: you need coordination as much as you need technical fixes.
This playbook is designed for small and mid-sized organizations (SMBs) that don’t have an enterprise incident response team, but still need a professional, repeatable way to handle high-impact outages.
What counts as a “major incident” (MI)?
A major incident is any incident that materially impacts customers, revenue, safety, regulatory obligations, or critical internal operations and requires focused coordination across multiple people/teams.
You don’t need a perfect definition. You need a fast and shared one. A good rule:
- If you need more than one engineer to resolve it, promote to MI.
- If you’re not sure, promote to MI (you can downgrade later).
A simple severity model (good enough for most SMBs)
Define severities in terms of impact and urgency, not feelings:
- SEV1: Widespread outage or critical business process unavailable; major customer impact; significant financial/reputational risk.
- SEV2: Partial outage or severe degradation; significant subset of users affected; workaround may exist.
- SEV3: Limited impact; minor degradation; no urgent comms required.
Pick one owner for the model (e.g., Head of Engineering / CTO) and write it down. Consistency matters more than perfect wording.
The minimum viable incident organization
The biggest SMB mistake is having everyone debug while nobody runs the incident. You only need three roles:
- Incident Commander (IC): Owns coordination, not the fix. Runs the call, sets priorities, keeps time, assigns tasks.
- Tech Lead (TL): Owns technical diagnosis and mitigation. Coordinates engineers working on the fix.
- Communications Lead (Comms): Owns internal/external updates and stakeholder management.
In a small company, one person may cover Comms briefly, but try to separate IC and TL as soon as possible.
Role clarity (what each role does and does not do)
IC
- Does: starts the incident, creates the channel, sets cadence, assigns owners, tracks decisions, declares mitigation/resolution.
- Does not: deep-dive into logs for 30 minutes while everyone waits.
TL
- Does: guides triage, delegates investigation, proposes mitigations, manages risk during changes.
- Does not: write stakeholder updates or argue about priorities in public channels.
Comms
- Does: sends updates on a schedule, manages customer-facing messaging, aligns leadership, documents timelines.
- Does not: speculate or debate root cause mid-incident.
The first 15 minutes (the part that determines everything)
Print this. Put it in your on-call runbook. The goal is to create order quickly.
Minute 0–2: Declare and stabilize
- Declare MI (SEV1/SEV2) and name it:
MI-YYYYMMDD-ShortDescription - Open one incident channel (Slack/Teams) and one video bridge
- Assign IC and TL
- Set an update cadence (e.g., every 15 minutes for SEV1)
Minute 2–5: Scope impact fast
Ask (and write answers in the incident channel):
- What is broken? (customer-visible symptoms, error rates, latency)
- Who is affected? (all users / region / plan / internal only)
- When did it start? (first alert, first customer report, first anomaly)
- What changed recently? (deployments, config changes, vendor incidents)
Minute 5–10: Start parallel workstreams
The TL assigns owners to workstreams:
- Mitigation: quickest way to reduce impact (rollback, disable feature, failover)
- Diagnosis: identify root cause candidates with evidence
- Comms: draft initial statement + internal briefing
- Vendor/Infra (if relevant): contact provider support, check status pages
Minute 10–15: Decide the first move
The IC drives a decision based on risk and impact:
- Roll back?
- Disable a feature flag?
- Scale capacity?
- Failover?
- Reduce blast radius (rate limiting, circuit breakers, read-only mode)?
Decision rule: prefer reversible mitigations first.
Communication: cadence beats brilliance
Stakeholders don’t need perfect technical explanations during an outage. They need:
- Acknowledgement
- Current impact
- What you’re doing next
- When the next update will be
Recommended cadences
- SEV1: every 15 minutes externally; 10–15 internally (leadership)
- SEV2: every 30–60 minutes; internal as needed
- SEV3: no scheduled updates unless escalated
External update template (status page / email)
Investigating: We are aware of an issue impacting {service/function}. Users may experience {symptoms}.
Next update: {time}.
Workaround (if any): {workaround}.
Internal update template (leadership / company-wide)
MI Update ({time})
Severity: SEV{1/2}
Impact: {who/what is impacted}
Current status: {investigating/mitigating/monitoring}
What we’ve tried: {top 2–3 facts}
Next actions: {top 1–2 actions}
Next update: {time}
Asks/decisions needed: {if any}
What not to do in comms
- Don’t speculate on root cause (“likely a database issue”) without evidence.
- Don’t promise ETAs unless you have concrete, low-risk steps to get there.
- Don’t go silent—if nothing changed, say “No change, still investigating; next update at X.”
A lightweight toolkit (works without enterprise platforms)
You can run excellent incidents with simple tools:
- Chat: one incident channel, plus threads for specific workstreams
- Video bridge: one shared meeting link
- Timeline doc: a shared document for key events/decisions
- Status page (optional but recommended): even a simple hosted page is better than ad-hoc emails
Minimum data to capture during the incident:
- Time incident declared
- Symptoms and scope
- Key decisions and who approved them
- Mitigations applied and outcomes
- Time impact reduced and time resolved
Mitigation strategies that work well for SMBs
Most incidents end faster when you treat them as impact management first, diagnosis second.
Roll back quickly (make it boring)
If you deploy frequently, rollbacks should be:
- Documented (one page)
- Fast (single command or button)
- Low-risk (tested regularly)
If rollbacks are slow or scary, you will “push through” and incidents will last longer than necessary.
Feature flags and kill switches
If you can turn off risky behavior without redeploying, you win.
At minimum:
- A “disable new feature” flag for any major launch
- A “read-only mode” or “degraded mode” plan for critical services
Reduce blast radius
Sometimes the best mitigation is to contain:
- Temporarily disable a region/tenant
- Rate limit heavy endpoints
- Shed load (graceful degradation)
- Serve cached data
This is not “giving up.” It’s protecting the core.
Decision-making under pressure
Make decisions explicit, and keep them small.
A useful decision framework
For each option, ask:
- Reversibility: Can we undo it quickly?
- Risk: What’s the worst realistic side effect?
- Speed: How fast does it reduce impact?
- Confidence: What evidence supports it?
When confidence is low, prefer reversible actions that reduce impact.
Change control during an incident (without slowing down)
Use a tiny “two-person rule”:
- Any production change during SEV1 requires TL + IC acknowledgement (“Proceeding with rollback now.”)
- Capture the decision in the incident channel
This prevents accidental escalation from rushed changes.
Declaring “mitigated” vs “resolved”
These are different. Treat them differently.
- Mitigated: customer impact materially reduced (e.g., error rates back to normal), but the system might still be fragile.
- Resolved: underlying issue fixed or safely bypassed; monitoring confirms stability; no urgent comms needed.
After mitigation, keep a short “stabilization window” (e.g., 30–60 minutes) with heightened monitoring.
Handover and fatigue management
SMBs get hurt by burnout during incidents.
Rules that help:
- If an incident runs longer than 60–90 minutes, plan for a handover.
- Write a short “current state” summary before switching people.
- The IC should protect the TL from interruptions—fatigue makes mistakes more likely.
After the incident: the part that makes you better
The real ROI of incident management is preventing repeats.
Run a post-incident review (PIR) within 3–5 business days
Keep it blameless and outcome-focused:
- What happened (timeline)
- What we saw (symptoms and signals)
- What we did (mitigations, decisions)
- Why it happened (root cause and contributing factors)
- What we’ll change (specific actions with owners and deadlines)
Action items that actually get done
Good action items are:
- Specific: “Add alert on payment error rate > 2% for 5 min”
- Owned: one named person
- Time-bound: a due date
- Verifiable: you can test or observe it
Avoid: “Improve monitoring,” “Refactor service,” “Be more careful.”
Metrics (don’t overcomplicate)
Track a small set consistently:
- MTTA (mean time to acknowledge): how fast you start the response
- MTTM (mean time to mitigate): how fast you reduce impact
- MTTR (mean time to resolve): how fast you fully stabilize
- Repeat rate: how often the same class of incident returns
The point isn’t to punish teams. It’s to see if your playbook is working.
A “printable” SMB MI checklist
Declare
- Promote to SEV1/SEV2 (when in doubt, promote)
- Assign IC and TL
- Create incident channel + bridge
- Set update cadence
Triage
- Scope impact (who/what/when)
- Check recent changes
- Start parallel workstreams (mitigation/diagnosis/comms/vendor)
Manage
- Capture key decisions in-channel
- Prioritize reversible mitigations
- Send updates on schedule
- Declare mitigation vs resolution explicitly
Close
- Confirm stability (monitoring + spot checks)
- Summarize timeline and decisions
- Schedule PIR within 3–5 days
- Create action items with owners and due dates
Final thought
You don’t need enterprise tooling to run a professional incident response. You need clarity, cadence, and practice. If you adopt the roles, the first-15-minutes checklist, and disciplined communications, you’ll reduce downtime and build trust—internally and externally.