Problem Management: The Discipline Everyone Skips and Always Regrets

Problem Management: The Discipline Everyone Skips and Always Regrets

Problem management is the practice most organisations claim to do and very few actually practise, and the cost of that gap is paid daily in repeated incidents.

First posted:
Read time:
7 minutes
Written by:
Steven Godson
ITSM

Problem Management: The Discipline Everyone Skips and Always Regrets

Of all the ITIL practices, problem management is the one most likely to appear on an organisation chart and least likely to appear in anyone's actual working week. Everybody agrees it matters. Everybody nods sagely when the same outage recurs for the third time. And yet, when I review a service management capability, problem management is almost always the practice that has been quietly downgraded to "something we'll do when the firefighting stops". The firefighting never stops, of course, precisely because nobody is doing problem management. My experience has been that this is one of the few genuine vicious circles in our profession, and breaking it requires deliberate effort rather than good intentions.

What Problem Management Actually Is

Separating the practice from the wishful thinking

There is a persistent confusion between incident management and problem management, and it costs organisations dearly. An incident is an unplanned interruption to a service. A problem is the underlying cause of one or more incidents. Resolving an incident restores service; resolving a problem stops the incident recurring. These are different objectives requiring different mindsets, and conflating them is the original sin.

  • Incident management is about speed — the goal is to restore normal service as quickly as possible, and that often means a workaround rather than a fix.
  • Problem management is about understanding — the goal is to establish cause and prevent recurrence, which is rarely fast and almost never glamorous.
  • The two are partners, not substitutes — a mature operation runs both in parallel, with incidents feeding the problem backlog and problem resolution shrinking the incident volume over time.

In my opinion, the single most useful sentence you can teach a service desk is this: closing the ticket is not the same as solving the problem. Once people internalise that, half the cultural battle is won.

Reactive Versus Proactive Problem Management

Why most teams only ever do the easy half

ITIL splits problem management into reactive and proactive activity, and in practice almost everyone does the reactive part badly and the proactive part not at all. Reactive problem management responds to incidents that have already occurred. Proactive problem management hunts for issues before they cause an interruption, using trend analysis, monitoring data and a degree of professional paranoia.

  • Reactive is investigation after the fact — something broke, we ask why, we attempt to ensure it does not break again.
  • Proactive is prevention through pattern recognition — analysing recurring incidents, near misses and monitoring signals to find latent weaknesses before they bite.
  • Proactive work needs protected time — it will never happen if you expect it to occur in the gaps between escalations, because there are no gaps.

My preference is to ring-fence proactive problem management as scheduled, named work with an owner, rather than leaving it as an aspiration. If it is everybody's job in principle, it is nobody's job in practice. I have watched too many "proactive problem management initiatives" quietly evaporate because they were never given a slot in anyone's actual calendar.

The Known Error Database Nobody Maintains

A genuinely valuable artefact, treated as an afterthought

A known error is a problem that has been analysed but not yet permanently resolved, typically with a documented workaround. The Known Error Database, or KEDB, is where this knowledge lives. When it is maintained, it is one of the most powerful efficiency tools in the entire ITSM estate. When it is neglected, which is the usual condition, the organisation rediscovers the same workarounds repeatedly, paying full price each time.

  • A good KEDB accelerates incident resolution — the service desk finds the documented workaround in seconds rather than escalating to a specialist who solves it from scratch.
  • It preserves institutional memory — knowledge that would otherwise leave with the engineer who happened to know the trick.
  • It must be linked to incidents and changes — an isolated KEDB is a curiosity; an integrated one is an asset.

I am always conscious that the KEDB is only as good as the discipline of populating it. The temptation, having found a workaround under pressure, is to move straight on to the next fire. In my opinion, capturing the known error at the moment of discovery should be a non-negotiable step, not a tidy-up task for a quieter day that never arrives.

Root Cause Analysis Without the Theatre

Doing the investigation properly rather than performing it

Root cause analysis has acquired a slightly ceremonial quality in some organisations. There is a template, a meeting, a document, and a sense of closure, but frequently no actual root cause and certainly no preventive action. Genuine RCA is harder and less satisfying than the theatrical version, because it often concludes that the cause is uncomfortable, systemic or political.

  • Use a method, not a hunch — techniques such as the Five Whys, fishbone diagrams or fault tree analysis impose useful discipline, provided you actually follow them.
  • Beware stopping at the first plausible cause — "human error" is almost never a root cause; it is the point at which lazy analysis gives up.
  • Track the corrective actions to completion — an RCA that identifies a fix nobody implements is an expensive form of documentation.

My experience has been that the quality of RCA correlates almost perfectly with whether the organisation is willing to hear bad news. Where there is a blameless culture, you get honest analysis. Where there is a hunt for someone to fault, you get carefully worded documents that explain nothing and protect everyone.

Problem Management in a DevOps and SRE World

The practice has not died, it has been renamed

There is a fashionable view that problem management is a relic of the slow old world and that DevOps and SRE have rendered it obsolete. I disagree, fairly strongly. What SRE calls a blameless post-incident review is, in substance, reactive problem management with better facilitation. What error budgets encourage is, in effect, the prioritisation of problem resolution over feature work. The vocabulary has changed; the discipline has not.

  • Post-incident reviews are problem records by another name — the difference is cultural framing, and the framing is genuinely an improvement.
  • Error budgets create the time problem management always lacked — when reliability is a measured commitment, fixing recurring problems becomes a business priority rather than a favour.
  • Observability data is proactive problem management fuel — modern telemetry gives you the trend analysis that older operations had to assemble by hand.

In my opinion the smartest thing a traditional ITSM team can do is borrow the blameless post-incident review wholesale and stop arguing about whose model it belongs to. The practices converge because the underlying need is universal.

Where to Start

Practical steps for an organisation that knows it should do better

If your problem management exists mainly on paper, here is where I would begin. None of this requires a transformation programme, which is precisely the point.

  • Separate problem records from incidents in your tooling — if you cannot distinguish the two, you cannot manage them differently.
  • Pick your top three recurring incidents and raise problems against them — start with the irritations everyone already complains about, because the value is obvious and the support is immediate.
  • Make KEDB entries a mandatory step in workaround discovery — capture the knowledge at the moment it exists, not later.
  • Schedule proactive analysis as named, owned work — even two hours a week is infinitely more than zero.
  • Adopt blameless reviews regardless of your operating model — honesty is the prerequisite for every other improvement.

The return on problem management is rarely visible in the quarter you invest in it, which is exactly why it gets cut. But the organisations that persist quietly find their incident volumes falling, their specialists freed from endless repetition, and their service desks answering questions instead of reraising the same fault for the hundredth time.

Hopefully this has been useful to you and I wish you well on your ITSM journey…

Estimated reading time: 7 minutes

Comments

Loading...

Leave a comment