Automated Log Analytics for DevOps - Troubleshooting at the Speed of Agile Development
June 06, 2014

Gal Berg
XpoLog

The move to Agile development with continuous deployment has created new challenges that required the restructuring of the relationship between the development and operations environments. Of course the term "DevOps" means a lot of different things to different people because "Agile" itself covers a lot of ground. But the main principle is to form a bridge between the development and the production environments in order to streamline processes and cope with continuous deployment of code.

Although the exact role and responsibilities of DevOps may not be defined across the board, it has to be able to rapidly deal with issues rising from the code and from the connection between the code and production environment. These responsibilities may include release management, configuration management, and dealing with problems (bugs) that find their way to the production environment. In this blog, I will focus on the latter and examine the tools and methods that can help expedite this process in order to cope with these continuous changes.

The Need for Investigation on Steroids

In an Agile environment, you may typically find multiple development teams that work simultaneously, introducing new code or code changes on a daily basis, and sometimes several times a day. In such a rapid environment it is extremely common to find problems that have "slipped through the cracks" to find themselves in the production environment. When these issues arise they have probably already impacted end-users, requiring a speedy resolution.

This means that DevOps have to conduct an extremely rapid investigation to identify the root cause of the problem. Identifying where these symptoms are coming from and then isolating the root cause is a challenging task. Symptoms can be found across various layers of a large hybrid IT environment, such as different servers/VMs, storage devices, databases, to the front-end and server side code. Investigations that traditionally would take hours or days to complete must be completed within minutes.

DevOps must examine the infrastructure and application logs as part of this investigation process. However, the massive amount of log records that are produced in these environments makes it virtually impossible to do this manually. It's much like trying to find a needle in a haystack.

In most cases these investigations are conducted through the use of log management and analysis systems that collect and aggregate these logs (from infrastructure elements and applications), centralizing them in a single place and then providing search capabilities to explore the data.

These solutions make it possible to conduct the investigation, but they still rely completely on the investigation skills and the knowledge of the user. The user must know exactly what to search for, and have a deep understanding of the environment in order to use them effectively.

It's important to understand that the log files of applications are far less predictable than the log files of infrastructure elements. The errors are essentially messages and error numbers that have been introduced to the code by developers in a non-consistent manner. Consequently, in most cases search queries yield thousands of results and do not include important ones, even when the user is skilled. That leaves the user with the same "needle in the haystack" situation.

As they say, "the best place to hide a dead body is page 2 of Google's search results", and this is not very different. This process fails in light of the fast-paced reality of the DevOps activity.

Assisting DevOps with Augmented Search

A new breed of log management and analysis technologies has evolved to solve this challenge. These technologies expedite the identification and investigation processes through the use of Augmented Search. Designed specifically to deal with the chaotic and unpredictable nature of application logs, Augmented Search takes into consideration that users don't necessarily know what to search for, especially in the chaotic application layer environment.

The analysis algorithm automatically identifies errors, risk factors, and problem indicators, while analyzing their severity by combining semantic processing, statistical models, and machine learning to analyze and "understand" the events in the logs. These insights are displayed as intelligence layers on top of the search results, helping the user to quickly discover the most relevant and important information.

Although, DevOps engineers may be familiar with the infrastructure and system architecture the data is constantly changing with continuous fast pace deployment cycles and constant code changes. This means that DevOps teams can use their intuition and knowledge to start investigating each problem, but they have blind spots that consume time due to dynamic nature of the log data.

Combining the decisions that DevOps makes during their investigation with the Augmented Search engine information layers on the important problems that occurred during the period of interest can help guide them through these blind spots quickly. The combination of the user's intellect, acquaintance with the system's architecture, and Augmented Search machine learning capabilities on the dynamic data makes it faster and easier to focus on the most relevant data. Here's how that works in practice:

One of the servers went down and any attempt to reinitiate the server has failed., However, since the process is running, the server seems to be up. In this case, end-users are complaining that an application is not responding. This symptom could be related to many problems in a complex environment with a large number of servers.

Focusing on the server that is behind this problem can be difficult, as it seems to be up. But finding the root cause of the problem requires a lengthy investigation even when you know which server is behind this problem.

Augmented Search will display a layer which highlights critical events that occurred during the specified time period instead of going over thousands of search results. These highlights provide information regarding the sources of the events, assisting in the triage process. At this point, DevOps can understand the impact of the problem (e.g. which servers are affected by it) and then continue the investigation to find the root cause of these problems.

Using Augmented Search, DevOps can identify a problem and the root cause in a matter of seconds instead of examining thousands of log events or running multiple checks on the various servers. Adding this type of visibility to log analysis, and the ability to surface critical events out of tens of thousands - and often millions - of events, is essential in a fast paced environment, in which changes are constantly introduced.

Gal Berg is VP R&D for XpoLog.

Share this

Industry News

November 20, 2024

Spectro Cloud completed a $75 million Series C funding round led by Growth Equity at Goldman Sachs Alternatives with participation from existing Spectro Cloud investors.

November 20, 2024

The Cloud Native Computing Foundation® (CNCF®), which builds sustainable ecosystems for cloud native software, has announced significant momentum around cloud native training and certifications with the addition of three new project-centric certifications and a series of new Platform Engineering-specific certifications:

November 20, 2024

Red Hat announced the latest version of Red Hat OpenShift AI, its artificial intelligence (AI) and machine learning (ML) platform built on Red Hat OpenShift that enables enterprises to create and deliver AI-enabled applications at scale across the hybrid cloud.

November 20, 2024

Salesforce announced agentic lifecycle management tools to automate Agentforce testing, prototype agents in secure Sandbox environments, and transparently manage usage at scale.

November 19, 2024

OpenText™ unveiled Cloud Editions (CE) 24.4, presenting a suite of transformative advancements in Business Cloud, AI, and Technology to empower the future of AI-driven knowledge work.

November 19, 2024

Red Hat announced new capabilities and enhancements for Red Hat Developer Hub, Red Hat’s enterprise-grade developer portal based on the Backstage project.

November 19, 2024

Pegasystems announced the availability of new AI-driven legacy discovery capabilities in Pega GenAI Blueprint™ to accelerate the daunting task of modernizing legacy systems that hold organizations back.

November 19, 2024

Tricentis launched enhanced cloud capabilities for its flagship solution, Tricentis Tosca, bringing enterprise-ready end-to-end test automation to the cloud.

November 19, 2024

Rafay Systems announced new platform advancements that help enterprises and GPU cloud providers deliver developer-friendly consumption workflows for GPU infrastructure.

November 19, 2024

Apiiro introduced Code-to-Runtime, a new capability using Apiiro’s deep code analysis (DCA) technology to map software architecture and trace all types of software components including APIs, open source software (OSS), and containers to code owners while enriching it with business impact.

November 19, 2024

Zesty announced the launch of Kompass, its automated Kubernetes optimization platform.

November 18, 2024

MacStadium announced the launch of Orka Engine, the latest addition to its Orka product line.

November 18, 2024

Elastic announced its AI ecosystem to help enterprise developers accelerate building and deploying their Retrieval Augmented Generation (RAG) applications.

Read the full news on APMdigest

November 18, 2024

Red Hat introduced new capabilities and enhancements for Red Hat OpenShift, a hybrid cloud application platform powered by Kubernetes, as well as the technology preview of Red Hat OpenShift Lightspeed.

November 18, 2024

Traefik Labs announced API Sandbox as a Service to streamline and accelerate mock API development, and Traefik Proxy v3.2.