How Machine Learning Can Transform Incident Management from a Burden into a DevOps Enabler
September 25, 2023

Ajay Singh
Zebrium

Among the most vital performance metrics for any DevOps team is mean time to recovery (MTTR) — the time to respond to a software or infrastructure issue and return to expected performance levels. As organizations embrace DevOps methodologies, with an emphasis on close collaboration and rapid iteration, they expect MTTR to improve.

But a long-running study of DevOps practices — Google Cloud's Accelerate State of DevOps Report — suggests that any historical gains in MTTR reduction have now plateaued. For years now, the time it takes to restore services has stayed about the same: less than a day for high performers but up to a week for middle-tier teams and up to a month for laggards. The fact that progress is flat despite big investments in people, tools and automation is a cause for concern. And it's happening because complexity is growing as quickly as these investments, meaning we're running harder without actually making progress.

The good news is that machine learning (ML) can help development teams break through this seemingly intractable MTTR barrier, transforming incident management (IM) into not only a core competency but also an enabler of successful DevOps.

The IM Conundrum

Development teams struggle with IM for two reasons. First, as organizations move to cloud-native architectures, developers become responsible for a complex environment built around microservices, with every application an assemblage of many discrete, loosely coupled components. Not even the most experienced developer can envision all the pieces of that puzzle and how they fit together.

The second reason the needle hasn't moved on IM is the rapid pace of change in today's applications. In the past, when teams uncovered an issue, they built diagnostic and alert rules to respond faster the next time. They could benefit from those rules for months or even years.

Today, however, teams might roll out new versions of microservices almost daily. Any incident-based rules they build remain relevant for weeks at best. There's simply not a big enough payoff for investing in such semi-automated IM.

The Power of ML-based IM

Just as there are two reasons teams struggle with IM, there are two fundamental IM challenges. First is incident troubleshooting, which can involve sifting through millions or even billions of events. Hunting for an unknown root cause by recognizing patterns and identifying outliers in vast quantities of data is mentally exhausting for developers.

But not for ML.

ML is well-suited to pattern recognition and anomaly detection. It can quickly learn the baseline of time-series metrics, or normal cadence of events, by training in any environment over a period of time.

The second IM challenge is root-cause analysis. Human brains aren't designed to identify root causes at the complexity and scale of modern applications. But ML is very good at it — even for event streams, or continuous series of events, which tend to be closer to root causes than time-series metrics.

Something else ML performs well is correlation. In a complex system, the symptoms of an incident might show up in a database, but the root cause might start in an authentication service or third-party API. Such correlations are much quicker to uncover using an ML model that has been trained on anomaly patterns and correlations from the same system.

Development teams are often hampered by a dearth of experienced site reliability engineers (SREs). Less experienced platform or network technicians often end up being the first line of IM defense. Because these team members are less familiar with application details, they rely on SREs to summarize incidents in written language — a costly and time-consuming process.

That's an area where LLMs excel. ML can identify errors and distill them to events. LLMs can then quickly describe the events in plain language terms less experienced staff can use to take action.

IM involves another type of connecting the dots, and that's contextualization. For example, have the root-cause events of today's problem been mentioned in past case notes, product documentation, or source code? If so, how do they inform our understanding of today's issue, and the best corrective steps? This is an area where generative AI powered by an LLM can help, by connecting ML-generated root-cause reports to the repository of tribal knowledge within each organization.

Achieving Positive IM Outcomes

Leveraging ML to improve IM can drive better outcomes in key ways:

Fast root-cause analysis

ML can uncover anomalies and correlations without having to know which queries to type, which filters to apply, and which events to look for. Once the ML model understands what normal looks like, it can spot outliers and their connections with "bad" events immediately.

Eradication of silent bugs

For catching new bugs, statistical techniques can determine that a cluster of anomalies isn't likely to occur by chance. That can enable you to uncover silent bugs that might not yet be directly associated with an incident.

Accelerated application recovery

Traditional incident troubleshooting is like hunting for a needle in a haystack without knowing what the needle looks like. ML is exponentially faster. ML can present developers with a report that lists a small number of potentially relevant events. A person still has to make a decision. But a 5X to 6X time reduction is achievable.

Ultimately, ML will transform IM from a necessary evil into a DevOps enabler. In The Lean Startup, author Eric Ries talks about creating an organizational "immune system." The idea is that if you make small adjustments every time you uncover a problem, you eventually build up defenses that allow you to quickly and nearly automatically recover from problems.

That's the eventual goal of leveraging ML for IM. When an incident occurs, the system should be smart enough to recognize that something went wrong, determine what happened, and automatically roll back the change that caused the problem. We might never achieve that level of automation for all incidents, but ML should soon be able to automate IM in 80% of cases. In the meantime, investing in ML-enabled IM can help development teams achieve tangible improvements today in IM and their DevOps effectiveness.

Ajay Singh is CEO of Zebrium
Share this

Industry News

April 25, 2024

JFrog announced a new machine learning (ML) lifecycle integration between JFrog Artifactory and MLflow, an open source software platform originally developed by Databricks.

April 25, 2024

Copado announced the general availability of Test Copilot, the AI-powered test creation assistant.

April 25, 2024

SmartBear has added no-code test automation powered by GenAI to its Zephyr Scale, the solution that delivers scalable, performant test management inside Jira.

April 24, 2024

Opsera announced that two new patents have been issued for its Unified DevOps Platform, now totaling nine patents issued for the cloud-native DevOps Platform.

April 23, 2024

mabl announced the addition of mobile application testing to its platform.

April 23, 2024

Spectro Cloud announced the achievement of a new Amazon Web Services (AWS) Competency designation.

April 22, 2024

GitLab announced the general availability of GitLab Duo Chat.

April 18, 2024

SmartBear announced a new version of its API design and documentation tool, SwaggerHub, integrating Stoplight’s API open source tools.

April 18, 2024

Red Hat announced updates to Red Hat Trusted Software Supply Chain.

April 18, 2024

Tricentis announced the latest update to the company’s AI offerings with the launch of Tricentis Copilot, a suite of solutions leveraging generative AI to enhance productivity throughout the entire testing lifecycle.

April 17, 2024

CIQ launched fully supported, upstream stable kernels for Rocky Linux via the CIQ Enterprise Linux Platform, providing enhanced performance, hardware compatibility and security.

April 17, 2024

Redgate launched an enterprise version of its database monitoring tool, providing a range of new features to address the challenges of scale and complexity faced by larger organizations.

April 17, 2024

Snyk announced the expansion of its current partnership with Google Cloud to advance secure code generated by Google Cloud’s generative-AI-powered collaborator service, Gemini Code Assist.

April 16, 2024

Kong announced the commercial availability of Kong Konnect Dedicated Cloud Gateways on Amazon Web Services (AWS).

April 16, 2024

Pegasystems announced the general availability of Pega Infinity ’24.1™.