Gestion des incidents majeurs : guide pratique pour les PME

Une approche structurée pour gérer les incidents critiques lorsque votre organisation ne dispose pas de ressources à l'échelle de l'entreprise.

If you’ve ever been in the situation where “something is down” and everyone is asking for updates while you’re still trying to understand what’s happening, you already know the core challenge of a major incident: you need coordination as much as you need technical fixes.

This playbook is designed for small and mid-sized organizations (SMBs) that don’t have an enterprise incident response team, but still need a professional, repeatable way to handle high-impact outages.

What counts as a “major incident” (MI)?

A major incident is any incident that materially impacts customers, revenue, safety, regulatory obligations, or critical internal operations and requires focused coordination across multiple people/teams.

You don’t need a perfect definition. You need a fast and shared one. A good rule:

If you need more than one engineer to resolve it, promote to MI.
If you’re not sure, promote to MI (you can downgrade later).

A simple severity model (good enough for most SMBs)

Define severities in terms of impact and urgency, not feelings:

SEV1: Widespread outage or critical business process unavailable; major customer impact; significant financial/reputational risk.
SEV2: Partial outage or severe degradation; significant subset of users affected; workaround may exist.
SEV3: Limited impact; minor degradation; no urgent comms required.

Pick one owner for the model (e.g., Head of Engineering / CTO) and write it down. Consistency matters more than perfect wording.

The minimum viable incident organization

The biggest SMB mistake is having everyone debug while nobody runs the incident. You only need three roles:

Incident Commander (IC): Owns coordination, not the fix. Runs the call, sets priorities, keeps time, assigns tasks.
Tech Lead (TL): Owns technical diagnosis and mitigation. Coordinates engineers working on the fix.
Communications Lead (Comms): Owns internal/external updates and stakeholder management.

In a small company, one person may cover Comms briefly, but try to separate IC and TL as soon as possible.

Role clarity (what each role does and does not do)

Does: starts the incident, creates the channel, sets cadence, assigns owners, tracks decisions, declares mitigation/resolution.
Does not: deep-dive into logs for 30 minutes while everyone waits.

Does: guides triage, delegates investigation, proposes mitigations, manages risk during changes.
Does not: write stakeholder updates or argue about priorities in public channels.

Comms

Does: sends updates on a schedule, manages customer-facing messaging, aligns leadership, documents timelines.
Does not: speculate or debate root cause mid-incident.

The first 15 minutes (the part that determines everything)

Print this. Put it in your on-call runbook. The goal is to create order quickly.

Minute 0–2: Declare and stabilize

Declare MI (SEV1/SEV2) and name it: MI-YYYYMMDD-ShortDescription
Open one incident channel (Slack/Teams) and one video bridge
Assign IC and TL
Set an update cadence (e.g., every 15 minutes for SEV1)

Minute 2–5: Scope impact fast

Ask (and write answers in the incident channel):

What is broken? (customer-visible symptoms, error rates, latency)
Who is affected? (all users / region / plan / internal only)
When did it start? (first alert, first customer report, first anomaly)
What changed recently? (deployments, config changes, vendor incidents)

Minute 5–10: Start parallel workstreams

The TL assigns owners to workstreams:

Mitigation: quickest way to reduce impact (rollback, disable feature, failover)
Diagnosis: identify root cause candidates with evidence
Comms: draft initial statement + internal briefing
Vendor/Infra (if relevant): contact provider support, check status pages

Minute 10–15: Decide the first move

The IC drives a decision based on risk and impact:

Roll back?
Disable a feature flag?
Scale capacity?
Failover?
Reduce blast radius (rate limiting, circuit breakers, read-only mode)?

Decision rule: prefer reversible mitigations first.

Communication: cadence beats brilliance

Stakeholders don’t need perfect technical explanations during an outage. They need:

Acknowledgement
Current impact
What you’re doing next
When the next update will be

Recommended cadences

SEV1: every 15 minutes externally; 10–15 internally (leadership)
SEV2: every 30–60 minutes; internal as needed
SEV3: no scheduled updates unless escalated

External update template (status page / email)

Investigating: We are aware of an issue impacting {service/function}. Users may experience {symptoms}.
Next update: {time}.
Workaround (if any): {workaround}.

Internal update template (leadership / company-wide)

MI Update ({time})
Severity: SEV{1/2}
Impact: {who/what is impacted}
Current status: {investigating/mitigating/monitoring}
What we’ve tried: {top 2–3 facts}
Next actions: {top 1–2 actions}
Next update: {time}
Asks/decisions needed: {if any}

What not to do in comms

Don’t speculate on root cause (“likely a database issue”) without evidence.
Don’t promise ETAs unless you have concrete, low-risk steps to get there.
Don’t go silent—if nothing changed, say “No change, still investigating; next update at X.”

A lightweight toolkit (works without enterprise platforms)

You can run excellent incidents with simple tools:

Chat: one incident channel, plus threads for specific workstreams
Video bridge: one shared meeting link
Timeline doc: a shared document for key events/decisions
Status page (optional but recommended): even a simple hosted page is better than ad-hoc emails

Minimum data to capture during the incident:

Time incident declared
Symptoms and scope
Key decisions and who approved them
Mitigations applied and outcomes
Time impact reduced and time resolved

Mitigation strategies that work well for SMBs

Most incidents end faster when you treat them as impact management first, diagnosis second.

Roll back quickly (make it boring)

If you deploy frequently, rollbacks should be:

Documented (one page)
Fast (single command or button)
Low-risk (tested regularly)

If rollbacks are slow or scary, you will “push through” and incidents will last longer than necessary.

Feature flags and kill switches

If you can turn off risky behavior without redeploying, you win.

At minimum:

A “disable new feature” flag for any major launch
A “read-only mode” or “degraded mode” plan for critical services

Reduce blast radius

Sometimes the best mitigation is to contain:

Temporarily disable a region/tenant
Rate limit heavy endpoints
Shed load (graceful degradation)
Serve cached data

This is not “giving up.” It’s protecting the core.

Decision-making under pressure

Make decisions explicit, and keep them small.

A useful decision framework

For each option, ask:

Reversibility: Can we undo it quickly?
Risk: What’s the worst realistic side effect?
Speed: How fast does it reduce impact?
Confidence: What evidence supports it?

When confidence is low, prefer reversible actions that reduce impact.

Change control during an incident (without slowing down)

Use a tiny “two-person rule”:

Any production change during SEV1 requires TL + IC acknowledgement (“Proceeding with rollback now.”)
Capture the decision in the incident channel

This prevents accidental escalation from rushed changes.

Declaring “mitigated” vs “resolved”

These are different. Treat them differently.

Mitigated: customer impact materially reduced (e.g., error rates back to normal), but the system might still be fragile.
Resolved: underlying issue fixed or safely bypassed; monitoring confirms stability; no urgent comms needed.

After mitigation, keep a short “stabilization window” (e.g., 30–60 minutes) with heightened monitoring.

Handover and fatigue management

SMBs get hurt by burnout during incidents.

Rules that help:

If an incident runs longer than 60–90 minutes, plan for a handover.
Write a short “current state” summary before switching people.
The IC should protect the TL from interruptions—fatigue makes mistakes more likely.

After the incident: the part that makes you better

The real ROI of incident management is preventing repeats.

Run a post-incident review (PIR) within 3–5 business days

Keep it blameless and outcome-focused:

What happened (timeline)
What we saw (symptoms and signals)
What we did (mitigations, decisions)
Why it happened (root cause and contributing factors)
What we’ll change (specific actions with owners and deadlines)

Action items that actually get done

Good action items are:

Specific: “Add alert on payment error rate > 2% for 5 min”
Owned: one named person
Time-bound: a due date
Verifiable: you can test or observe it

Avoid: “Improve monitoring,” “Refactor service,” “Be more careful.”

Metrics (don’t overcomplicate)

Track a small set consistently:

MTTA (mean time to acknowledge): how fast you start the response
MTTM (mean time to mitigate): how fast you reduce impact
MTTR (mean time to resolve): how fast you fully stabilize
Repeat rate: how often the same class of incident returns

The point isn’t to punish teams. It’s to see if your playbook is working.

A “printable” SMB MI checklist

Declare

Promote to SEV1/SEV2 (when in doubt, promote)
Assign IC and TL
Create incident channel + bridge
Set update cadence

Triage

Scope impact (who/what/when)
Check recent changes
Start parallel workstreams (mitigation/diagnosis/comms/vendor)

Manage

Capture key decisions in-channel
Prioritize reversible mitigations
Send updates on schedule
Declare mitigation vs resolution explicitly

Close

Confirm stability (monitoring + spot checks)
Summarize timeline and decisions
Schedule PIR within 3–5 days
Create action items with owners and due dates

Final thought

You don’t need enterprise tooling to run a professional incident response. You need clarity, cadence, and practice. If you adopt the roles, the first-15-minutes checklist, and disciplined communications, you’ll reduce downtime and build trust—internally and externally.