How Machine Learning Can Transform Incident Management from a Burden into a DevOps Enabler

September 25, 2023

Ajay Singh
Zebrium

Among the most vital performance metrics for any DevOps team is mean time to recovery (MTTR) — the time to respond to a software or infrastructure issue and return to expected performance levels. As organizations embrace DevOps methodologies, with an emphasis on close collaboration and rapid iteration, they expect MTTR to improve.

But a long-running study of DevOps practices — Google Cloud's Accelerate State of DevOps Report(link is external) — suggests that any historical gains in MTTR reduction have now plateaued. For years now, the time it takes to restore services has stayed about the same: less than a day for high performers but up to a week for middle-tier teams and up to a month for laggards. The fact that progress is flat despite big investments in people, tools and automation is a cause for concern. And it's happening because complexity is growing as quickly as these investments, meaning we're running harder without actually making progress.

The good news is that machine learning (ML) can help development teams break through this seemingly intractable MTTR barrier, transforming incident management (IM) into not only a core competency but also an enabler of successful DevOps.

The IM Conundrum

Development teams struggle with IM for two reasons. First, as organizations move to cloud-native architectures, developers become responsible for a complex environment built around microservices, with every application an assemblage of many discrete, loosely coupled components. Not even the most experienced developer can envision all the pieces of that puzzle and how they fit together.

The second reason the needle hasn't moved on IM is the rapid pace of change in today's applications. In the past, when teams uncovered an issue, they built diagnostic and alert rules to respond faster the next time. They could benefit from those rules for months or even years.

Today, however, teams might roll out new versions of microservices almost daily. Any incident-based rules they build remain relevant for weeks at best. There's simply not a big enough payoff for investing in such semi-automated IM.

The Power of ML-based IM

Just as there are two reasons teams struggle with IM, there are two fundamental IM challenges. First is incident troubleshooting, which can involve sifting through millions or even billions of events. Hunting for an unknown root cause by recognizing patterns and identifying outliers in vast quantities of data is mentally exhausting for developers.

But not for ML.

ML is well-suited to pattern recognition and anomaly detection. It can quickly learn the baseline of time-series metrics, or normal cadence of events, by training in any environment over a period of time.

The second IM challenge is root-cause analysis. Human brains aren't designed to identify root causes at the complexity and scale of modern applications. But ML is very good at it — even for event streams, or continuous series of events, which tend to be closer to root causes than time-series metrics.

Something else ML performs well is correlation. In a complex system, the symptoms of an incident might show up in a database, but the root cause might start in an authentication service or third-party API. Such correlations are much quicker to uncover using an ML model that has been trained on anomaly patterns and correlations from the same system.

Development teams are often hampered by a dearth of experienced site reliability engineers (SREs). Less experienced platform or network technicians often end up being the first line of IM defense. Because these team members are less familiar with application details, they rely on SREs to summarize incidents in written language — a costly and time-consuming process.

That's an area where LLMs excel. ML can identify errors and distill them to events. LLMs can then quickly describe the events in plain language terms less experienced staff can use to take action.

IM involves another type of connecting the dots, and that's contextualization. For example, have the root-cause events of today's problem been mentioned in past case notes, product documentation, or source code? If so, how do they inform our understanding of today's issue, and the best corrective steps? This is an area where generative AI powered by an LLM can help, by connecting ML-generated root-cause reports to the repository of tribal knowledge within each organization.

Achieving Positive IM Outcomes

Leveraging ML to improve IM can drive better outcomes in key ways:

Fast root-cause analysis

ML can uncover anomalies and correlations without having to know which queries to type, which filters to apply, and which events to look for. Once the ML model understands what normal looks like, it can spot outliers and their connections with "bad" events immediately.

Eradication of silent bugs

For catching new bugs, statistical techniques can determine that a cluster of anomalies isn't likely to occur by chance. That can enable you to uncover silent bugs that might not yet be directly associated with an incident.

Accelerated application recovery

Traditional incident troubleshooting is like hunting for a needle in a haystack without knowing what the needle looks like. ML is exponentially faster. ML can present developers with a report that lists a small number of potentially relevant events. A person still has to make a decision. But a 5X to 6X time reduction is achievable.

Ultimately, ML will transform IM from a necessary evil into a DevOps enabler. In The Lean Startup(link is external), author Eric Ries talks about creating an organizational "immune system." The idea is that if you make small adjustments every time you uncover a problem, you eventually build up defenses that allow you to quickly and nearly automatically recover from problems.

That's the eventual goal of leveraging ML for IM. When an incident occurs, the system should be smart enough to recognize that something went wrong, determine what happened, and automatically roll back the change that caused the problem. We might never achieve that level of automation for all incidents, but ML should soon be able to automate IM in 80% of cases. In the meantime, investing in ML-enabled IM can help development teams achieve tangible improvements today in IM and their DevOps effectiveness.

Ajay Singh is CEO of Zebrium

Industry News

StackGen Platform Now Live on Google Cloud Marketplace

April 03, 2025

StackGen has partnered with Google Cloud Platform (GCP) to bring its platform to the Google Cloud Marketplace.

Tricentis Launches Cloud-Based Test Data Capabilities for Tosca

April 03, 2025

Tricentis announced its spring release of new cloud capabilities for the company’s AI-powered, model-based test automation solution, Tricentis Tosca.

Lucid Software Acquires airfocus

April 03, 2025

Lucid Software has acquired airfocus, an AI-powered product management and roadmapping platform designed to help teams prioritize and build the right products faster.

AutonomyAI Emerges from Stealth

April 03, 2025

AutonomyAI announced its launch from stealth with $4 million in pre-seed funding.

Kong AI Gateway 3.10 Released

April 02, 2025

Kong announced the launch of the latest version of Kong AI Gateway, which introduces new features to provide the AI security and governance guardrails needed to make GenAI and Agentic AI production-ready.

Traefik Labs EnhancesAI Gateway

April 02, 2025

Traefik Labs announced significant enhancements to its AI Gateway platform along with new developer tools designed to streamline enterprise AI adoption and API development.

Zencoder Releases New AI Coding and Unit Testing Agents

April 02, 2025

Zencoder released its next-generation AI coding and unit testing agents, designed to accelerate software development for professional engineers.

Windsurf and Netlify Launch AI IDE-Native Deployment Integration

April 02, 2025

Windsurf (formerly Codeium) and Netlify announced a new technology partnership that brings seamless, one-click deployment directly into the developer's integrated development environment (IDE.)

Opsera Raises $20M in Series B Funding

April 02, 2025

Opsera raised $20M in Series B funding.

CNCF Updates Certification Offerings

April 02, 2025

The Cloud Native Computing Foundation® (CNCF®), which builds sustainable ecosystems for cloud native software, is making significant updates to its certification offerings.

CNCF Launches Golden Kubestronaut Program

April 01, 2025

The Cloud Native Computing Foundation® (CNCF®), which builds sustainable ecosystems for cloud native software, announced the Golden Kubestronaut program, a distinguished recognition for professionals who have demonstrated the highest level of expertise in Kubernetes, cloud native technologies, and Linux administration.

Red Hat Developer Hub 1.5 Released

April 01, 2025

Red Hat announced new capabilities and enhancements for Red Hat Developer Hub, Red Hat’s enterprise-grade internal developer portal based on the Backstage project.

Platform9 Releases Free Community Edition

April 01, 2025

Platform9 announced that Private Cloud Director Community Edition is generally available.

Sonatype Expands Support for Rust

March 31, 2025

Sonatype expanded support for software development in Rust via the Cargo registry to the entire Sonatype product suite.

CloudBolt Acquires StormForge

March 31, 2025

CloudBolt Software announced its acquisition of StormForge, a provider of machine learning-powered Kubernetes resource optimization.

DEVOPSdigest

The IM Conundrum

The Power of ML-based IM

Achieving Positive IM Outcomes

Industry News

Upcoming Webinars

On-Demand Webinars

Analyst Reports

White Papers

Media Partners

The Latest

Hot Topics

The IM Conundrum

The Power of ML-based IM

Achieving Positive IM Outcomes

Related Links

Industry News

Search form

Upcoming Webinars

On-Demand Webinars

Analyst Reports

White Papers

Media Partners

User login

The Latest

Hot Topics