How Machine Learning Can Transform Incident Management from a Burden into a DevOps Enabler
September 25, 2023

Ajay Singh
Zebrium

Among the most vital performance metrics for any DevOps team is mean time to recovery (MTTR) — the time to respond to a software or infrastructure issue and return to expected performance levels. As organizations embrace DevOps methodologies, with an emphasis on close collaboration and rapid iteration, they expect MTTR to improve.

But a long-running study of DevOps practices — Google Cloud's Accelerate State of DevOps Report — suggests that any historical gains in MTTR reduction have now plateaued. For years now, the time it takes to restore services has stayed about the same: less than a day for high performers but up to a week for middle-tier teams and up to a month for laggards. The fact that progress is flat despite big investments in people, tools and automation is a cause for concern. And it's happening because complexity is growing as quickly as these investments, meaning we're running harder without actually making progress.

The good news is that machine learning (ML) can help development teams break through this seemingly intractable MTTR barrier, transforming incident management (IM) into not only a core competency but also an enabler of successful DevOps.

The IM Conundrum

Development teams struggle with IM for two reasons. First, as organizations move to cloud-native architectures, developers become responsible for a complex environment built around microservices, with every application an assemblage of many discrete, loosely coupled components. Not even the most experienced developer can envision all the pieces of that puzzle and how they fit together.

The second reason the needle hasn't moved on IM is the rapid pace of change in today's applications. In the past, when teams uncovered an issue, they built diagnostic and alert rules to respond faster the next time. They could benefit from those rules for months or even years.

Today, however, teams might roll out new versions of microservices almost daily. Any incident-based rules they build remain relevant for weeks at best. There's simply not a big enough payoff for investing in such semi-automated IM.

The Power of ML-based IM

Just as there are two reasons teams struggle with IM, there are two fundamental IM challenges. First is incident troubleshooting, which can involve sifting through millions or even billions of events. Hunting for an unknown root cause by recognizing patterns and identifying outliers in vast quantities of data is mentally exhausting for developers.

But not for ML.

ML is well-suited to pattern recognition and anomaly detection. It can quickly learn the baseline of time-series metrics, or normal cadence of events, by training in any environment over a period of time.

The second IM challenge is root-cause analysis. Human brains aren't designed to identify root causes at the complexity and scale of modern applications. But ML is very good at it — even for event streams, or continuous series of events, which tend to be closer to root causes than time-series metrics.

Something else ML performs well is correlation. In a complex system, the symptoms of an incident might show up in a database, but the root cause might start in an authentication service or third-party API. Such correlations are much quicker to uncover using an ML model that has been trained on anomaly patterns and correlations from the same system.

Development teams are often hampered by a dearth of experienced site reliability engineers (SREs). Less experienced platform or network technicians often end up being the first line of IM defense. Because these team members are less familiar with application details, they rely on SREs to summarize incidents in written language — a costly and time-consuming process.

That's an area where LLMs excel. ML can identify errors and distill them to events. LLMs can then quickly describe the events in plain language terms less experienced staff can use to take action.

IM involves another type of connecting the dots, and that's contextualization. For example, have the root-cause events of today's problem been mentioned in past case notes, product documentation, or source code? If so, how do they inform our understanding of today's issue, and the best corrective steps? This is an area where generative AI powered by an LLM can help, by connecting ML-generated root-cause reports to the repository of tribal knowledge within each organization.

Achieving Positive IM Outcomes

Leveraging ML to improve IM can drive better outcomes in key ways:

Fast root-cause analysis

ML can uncover anomalies and correlations without having to know which queries to type, which filters to apply, and which events to look for. Once the ML model understands what normal looks like, it can spot outliers and their connections with "bad" events immediately.

Eradication of silent bugs

For catching new bugs, statistical techniques can determine that a cluster of anomalies isn't likely to occur by chance. That can enable you to uncover silent bugs that might not yet be directly associated with an incident.

Accelerated application recovery

Traditional incident troubleshooting is like hunting for a needle in a haystack without knowing what the needle looks like. ML is exponentially faster. ML can present developers with a report that lists a small number of potentially relevant events. A person still has to make a decision. But a 5X to 6X time reduction is achievable.

Ultimately, ML will transform IM from a necessary evil into a DevOps enabler. In The Lean Startup, author Eric Ries talks about creating an organizational "immune system." The idea is that if you make small adjustments every time you uncover a problem, you eventually build up defenses that allow you to quickly and nearly automatically recover from problems.

That's the eventual goal of leveraging ML for IM. When an incident occurs, the system should be smart enough to recognize that something went wrong, determine what happened, and automatically roll back the change that caused the problem. We might never achieve that level of automation for all incidents, but ML should soon be able to automate IM in 80% of cases. In the meantime, investing in ML-enabled IM can help development teams achieve tangible improvements today in IM and their DevOps effectiveness.

Ajay Singh is CEO of Zebrium
Share this

Industry News

January 16, 2025

Mendix, a Siemens business, announced the general availability of Mendix 10.18.

January 16, 2025

Red Hat announced the general availability of Red Hat OpenShift Virtualization Engine, a new edition of Red Hat OpenShift that provides a dedicated way for organizations to access the proven virtualization functionality already available within Red Hat OpenShift.

January 16, 2025

Contrast Security announced the release of Application Vulnerability Monitoring (AVM), a new capability of Application Detection and Response (ADR).

January 15, 2025

Red Hat announced the general availability of Red Hat Connectivity Link, a hybrid multicloud application connectivity solution that provides a modern approach to connecting disparate applications and infrastructure.

January 15, 2025

Appfire announced 7pace Timetracker for Jira is live in the Atlassian Marketplace.

January 14, 2025

SmartBear announced the availability of SmartBear API Hub featuring HaloAI, an advanced AI-driven capability being introduced across SmartBear's product portfolio, and SmartBear Insight Hub.

January 14, 2025

Azul announced that the integrated risk management practices for its OpenJDK solutions fully support the stability, resilience and integrity requirements in meeting the European Union’s Digital Operational Resilience Act (DORA) provisions.

January 14, 2025

OpsVerse announced a significantly enhanced DevOps copilot, Aiden 2.0.

January 13, 2025

Progress received multiple awards from prestigious organizations for its inclusive workplace, culture and focus on corporate social responsibility (CSR).

January 13, 2025

Red Hat has completed its acquisition of Neural Magic, a provider of software and algorithms that accelerate generative AI (gen AI) inference workloads.

January 13, 2025

Code Intelligence announced the launch of Spark, an AI test agent that autonomously identifies bugs in unknown code without human interaction.

January 09, 2025

Checkmarx announced a new generation in software supply chain security with its Secrets Detection and Repository Health solutions to minimize application risk.

January 08, 2025

SmartBear has appointed Dan Faulkner, the company’s Chief Product Officer, as Chief Executive Officer.

January 07, 2025

Horizon3.ai announced the release of NodeZero™ Kubernetes Pentesting, a new capability available to all NodeZero users.

January 06, 2025

GitHub announced GitHub Copilot Free.