
The Major Incident: When Everything Is on Fire and Everyone Is Watching
Major incident management is the discipline that exposes every weakness in your service organisation at precisely the worst moment, so here is how to run it properly.
The Major Incident: When Everything Is on Fire and Everyone Is Watching
There is a particular quality to the silence in a service organisation in the thirty seconds after someone realises a major incident is unfolding. It is the silence of several hundred people simultaneously hoping it is somebody else's problem. My experience has been that the gap between a competent service organisation and a chaotic one is never wider than during a major incident, because everything you skimped on during peacetime presents its invoice all at once. Run-of-the-mill incident management is largely a process exercise. Major incident management is a test of character, communication and command, and it cannot be improvised on the day.
Declaring the Incident: The Decision Nobody Wants to Make
Why hesitation costs more than over-reaction
The single most expensive moment in any major incident is the period during which nobody is willing to call it one. People sit on a degrading situation, hoping it self-heals, because declaring a major incident feels like an admission of failure and an act that summons senior management.
- Define declaration criteria in advance — agree the objective triggers (revenue-impacting, customer-facing, regulatory, safety-related) so the decision is mechanical rather than political.
- Empower juniors to invoke — in my opinion the right to declare a major incident should sit with whoever first recognises it, not with whoever is most senior, because severity does not wait for a manager to return from lunch.
- Treat false alarms as cheap — a stood-down major incident costs an hour of mild embarrassment, whereas a delayed declaration costs the business properly.
I am always conscious that organisations punish over-declaration far more visibly than under-declaration, which trains people to hesitate at exactly the wrong moment.
The Major Incident Manager: A Role, Not a Person
The conductor who must not play an instrument
The most common failure I see is the major incident manager who cannot resist diving into the technical detail. The instant your incident commander is reading log files, you no longer have an incident commander.
- Separate command from resolution — the major incident manager coordinates, communicates and removes blockers; the technical teams diagnose and fix.
- Make the role explicit and rotating — it should be a named, trained, rostered function rather than whichever poor soul happened to answer the bridge.
- Give them the authority to interrupt — they must be able to pull people off other work, escalate to vendors and convene leadership without seeking permission first.
My preference is for the major incident manager to be deliberately non-specialist, because the temptation to "just have a quick look" is the enemy of clear coordination.
The War Room and the Bridge: Communication Under Pressure
Managing two audiences who want very different things
During a major incident you are running two parallel conversations. The technical bridge wants signal and quiet. The stakeholders want reassurance and frequency. These needs are in direct conflict and must be physically and procedurally separated.
- Keep the technical bridge sacred — no executives narrating their anxiety, no "any update yet?" every ninety seconds. Diagnosis needs concentration.
- Run a separate stakeholder cadence — issue structured updates at a fixed interval (every 30 minutes is my usual default) even when there is nothing new, because silence breeds escalation calls.
- Say what you know, what you do not, and when you will speak again — that three-part structure prevents the rumour-filling that does so much damage.
In my opinion the discipline of communicating on a fixed clock, regardless of progress, is the most underrated skill in the entire discipline. People tolerate a long outage far better than they tolerate being ignored.
Restoration First, Root Cause Later
The unnatural act of not fixing the cause
ITIL 4 is quite clear that the purpose of incident management is to restore normal service as quickly as possible, and yet engineers instinctively reach for the root cause because that is the intellectually satisfying problem.
- Prioritise the workaround — failover, reroute, roll back, restart. Elegance is for the problem record.
- Resist the urge to understand — the deep investigation is genuinely important, but it belongs to problem management once the bleeding has stopped.
- Record everything contemporaneously — timestamps, actions, decisions and who took them, because nobody remembers accurately at 3am and the post-incident review depends on it.
My experience has been that the teams who restore fastest are the ones who have explicitly given themselves permission to apply an ugly fix and feel no shame about it.
The Post-Incident Review: Blameless or Pointless
Where the real value is either created or destroyed
A major incident that is not properly reviewed is simply a rehearsal for the next one. The review is where the cost is finally converted into something useful, and it is astonishing how often organisations skip it because everyone is exhausted and relieved.
- Make it blameless and mean it — the moment a review becomes a search for someone to discipline, you will never again receive honest information about what actually happened.
- Interrogate the system, not the individual — the right question is "why was it possible for one person to make this mistake", not "who made the mistake".
- Produce actions with owners and dates — a review that generates insight but no committed remediation is theatre.
In my opinion the blameless review is the single highest-leverage practice in service management, and it is also the one most quietly undermined by organisational culture that demands a name to blame.
Where to Start
Practical first steps for the unglamorous work of preparation
If your major incident process exists only as a document nobody has read, here is where I would begin:
- Run a game day — simulate a major incident in a controlled setting and watch where the coordination breaks. It will break somewhere you did not expect.
- Write the declaration criteria down — get the objective triggers agreed and published so nobody has to be brave to invoke them.
- Train and roster major incident managers — make it a real, named, supported role with proper handover.
- Template your communications — pre-write the stakeholder update structure so you are not composing prose during a crisis.
- Hold the review every time — even for the small ones, because the muscle you build on the minor incidents is the muscle you will rely on during the major one.
The brutal truth is that major incident capability is built entirely in peacetime. You cannot acquire it on the day, and the day always comes.
Hopefully this has been useful to you and I wish you well on your ITSM journey…
Estimated reading time: 8 minutes



Comments
Loading...