How Artificial Intelligence is Revolutionizing IT Operation Analytics
July 20, 2016

Akhil Sahai
Perspica

After many science fiction plots and decades of research, Artificial Intelligence (AI) is being applied across industries for a wide variety of purposes. AI, Big Data and human domain knowledge are converging to create possibilities formerly only dreamed of. The time is ripe for IT operations to incorporate AI into its processes.

IT infrastructures today are increasingly dynamic and agile but at the same time extraordinarily complex. Humans are no longer able to sift through the variety, volume and velocity of Big Data streaming out of IT infrastructures in real time, making AI—especially machine learning—a powerful and necessary tool for automating analysis and decision making. By helping teams bridge the gap between Big Data and humans, and by capturing human domain knowledge, machine learning is able to provide the necessary operational intelligence to significantly relieve this burden of near real-time, informed decision-making. Industry analysts agree. In fact, Gartner named machine learning among the top 10 strategic technologies for 2016, noting “The explosion of data sources and complexity of information makes manual classification and analysis infeasible and uneconomical.”

IT administrators, IT operators for TechOps and Site Reliability Engineers (SRE) for DevOps are tasked with manually gathering this disparate information and applying their domain expertise in an attempt to make informed decisions. While these professionals are great at what they do, trying to analyze so much data from multiple tools leaves the door wide open for human error. On the other hand, analytics that are based on machine learning are quickly becoming a necessity to ensure the availability, reliability, performance and security of applications in today's digital, virtualized and hybrid-cloud network environments.

The traditional approach centered around using multiple monitoring tools for IT siloes that provided IT operations teams with information about their virtual and physical infrastructure, application infrastructure and application transaction performance. While these tools provide pieces of the puzzle, they offer a narrow view of the IT infrastructure and, therefore, only one aspect of the tool chain. The other aspect is service desk tools that manage tickets and change management. Humans more often than not bridge this gap between the siloed monitoring tools of yesterday and service desk applications with their domain expertise.

What Analytics Can Do Now

Today, the entire application infrastructure stack is overflowing with Big Data. TechOps and DevOps environments need to automate, learn and make intelligent, informed decisions based on real-time analysis of all that data. Following are key analytics for IT operations:

1. Anomaly Detection: Machine learning algorithms should have the ability to look at contextual, historical and sudden changes in the behavior of objects to detect anomalies. Understanding when there is a real anomaly and more importantly, when there is not, is critical to avoid generating false alarms. This is the bedrock of what is typically referred to as diagnostic analytics.

2. Topology Analysis: This type of analytics understands the hierarchal, peer-to-peer and temporal relationship between hybrid cloud elements. Topology is something every IT administrator or SRE should be aware of. This type of analysis should be able to self-learn the inter-relationships of objects and the impact of their performance on one another. Learning those relationships and maintaining that understanding in order to spot trouble in time is extremely important for both TechOps and DevOps environments.

3. Behavior Profiling: This is about understanding the behavior profile of every metric, how that is incorporated into the object behavior and then how the object behaviors relate to other object behaviors across the hybrid cloud environment. It is a multi-dimensional problem, and understanding and adapting to “normal” behavior is extremely important.

4. Root Cause: By finding the specific cause and impact of an incident, root-cause analysis is able to fast-track the resolution and reduce mean time to repair substantially.

5. Predictive: These analytics help operators identify early indicators and provide insights into looming problems that may eventually lead to performance degradation and outages. Predictive analytics are also good at providing early insights into anomalies to better plan for what's ahead.

6. Prescriptive: When you are looking for intelligent and actionable recommendations to remediate an incident, prescriptive analytics are the way to go. These recommendations should capture tribal knowledge gathered over the years in the organization and best practices in the industry, and may even be crowd-sourced to capture state-of-the-art knowledge. These analytics provide the opportunity to finally close the loop in automated IT Operations Management.

Embracing Machine Learning

It's been tough for a while now to be in IT operations, having to constantly react to incidents as well as trying to resolve them after they have spun out of control. Instead, AI provides technologies to help automate many of these tasks in order to handle incidents in advance. The whole notion of automating IT operational tasks, as well as preventing outages in the first place, and getting to the root cause quickly and in an automated way is the next frontier in remediating these issues.

As Gartner so eloquently put it, manual classification and analysis is infeasible and uneconomical. Not even an army of IT staff could review monitoring data quickly and thoroughly enough to identify incidents. Fortunately, AI has the capacity to enable real-time decision making by using multiple analytics capabilities simultaneously to see what's going on across the application stack.

Akhil Sahai, Ph.D., is VP Product Management at Perspica.

The Latest

August 13, 2018

Agile is expanding within the enterprise. Agile adoption is growing within organizations, both more broadly and deeply, according to the 12th annual State of Agile report from CollabNet VersionOne. A higher percentage of respondents this year report that "all or almost all" of their teams are agile, and that agile principles and practices are being adopted at higher levels in the organization ...

August 09, 2018

For the past 13 years, the Ponemon Institute has examined the cost associated with data breaches of less than 100,000 records, finding that the costs have steadily risen over the course of the study. The average cost of a data breach was $3.86 million in the 2018 study, compared to $3.50 million in 2014 – representing nearly 10 percent net increase over the past 5 years of the study ...

August 08, 2018

Hidden costs in data breaches – such as lost business, negative impact on reputation and employee time spent on recovery – are difficult and expensive to manage, according to the 2018 Cost of a Data Breach Study, sponsored by IBM Security and conducted by Ponemon Institute. The study found that the average cost of a data breach globally is $3.86 million ...

August 06, 2018

The previous chapter in this WhiteHat Security series discussed dependencies as the second step of the Twelve-Factor App. This next chapter examines the security component of step three of the Twelve-Factor methodology — storing configurations within the environment.

August 02, 2018

Results from new Forrester Consulting research reveal the 20 most important Agile and DevOps quality metrics that separate DevOps/Agile experts from their less advanced peers ...

July 31, 2018

Even organizations that understand the importance of cybersecurity in theory often stumble when it comes to marrying security initiatives with their development and operations processes. Most businesses agree that everyone should be responsible for security, but this principle is not being upheld on a day-to-day basis in many organizations. That’s bad news for everyone. Here are some best practices for implementing SecOps ...

July 30, 2018

While the technologies, processes, and cultural shifts of DevOps have improved the ability of software teams to deliver reliable work rapidly and effectively, security has not been a focal point in the transformation of cloud IT infrastructure. SecOps is a methodology that seeks to address this by operationalizing and hardening security throughout the software lifecycle ...

July 26, 2018

Organizations are shifting away from traditional, monolithic architectures, with three-quarters of survey respondents delivering at least some of their applications and more than one-third delivering most of their applications as microservices, according to the State of DevOps Observability Report from Scalyr ...

July 24, 2018

What top considerations must companies make to ensure – or at least help improve – Agile at scale? The following are key techniques and practices to help accelerate Agile delivery rollouts and scale Agile and DevOps in the Enterprise ...

July 23, 2018

Digital transformation is an important part of most corporate agendas for 2018. Successful digital transformation, encompassing your current business, partners and both current and prospective customers, isn't always easy. However, adopting an enterprise-wide Agile methodology can help ease the burden and deliver discernible ROI much faster ...

Share this