Why We're Still Talking About DevOps in 2024

August 06, 2024

Micah Adams
Focused Labs

The Crowdstrike outage has created no shortage of commentary, speculation, and armchair analysis on exactly how such a massive failure could occur. The level of discussion and scrutiny is warranted, most agree this is probably the largest IT outage in history(link is external). The impact on Crowdstrike's stock price was immediately felt, causing shares to tumble dramatically(link is external) in the wake of the incident.

As of this writing, we don't have an official post-incident analysis, but most observers agree that the global outage was due to a code deployment that skipped quality checks before it was deployed(link is external).

The knee jerk response of a "How could someone possibly let this happen" is both clueless and misinformed. Oh, if I could only alleviate your fears and somehow console you that this is an uncommon issue in the software development lifecycle. If only we were all so pious in our practices, processes, and methodologies that deployments like this were a microcosm of technology systems. If only the controls we put in place were fool proof, unbreakable, and predictable.

Rather than disparage the incident as a careless mistake taken on by uniformed or under achieving engineers, we should take a page from some of the highest achievers in resiliency, reliability, and scale in the technology sector. Let us strive to consider the outage in a blameless mindset. If you remove the actor from the situation, and reframe the outage based on the facts and data of the incident, you suddenly begin to start asking the right questions. Namely, "Why?"

Incident response practices don't stop at the first "Why?", though. We must strive to fully understand the root cause of the incident, and so a blameless post mortem culture scrutinizes the facts of an incident until satisfied that the true indication of the why is understood.

Perhaps the build systems were experiencing errors. Maybe the deploy process was being circumvented to hit a business deadline (a technologist's favorite scenario). Maybe (one of my favorite sources of outages) an entire team reviewed the code in question, line by line, and all agreed that it "Looks Good To Me" and blessed the deployment with another of my favorite cliches, "Ship It!"

The harsh reality is that all who work in technology are one bad push away from experiencing their own personally inflicted Crowdstrike hellscape. And, a more sobering take is that we don't always recognize the "bad code" until it is well underway, causing chaos and pain for all of our stakeholders.

Perhaps those of us with smaller market shares and less global presence should take solace in the fact that due to our size and market share, we'll never experience anything this bad. Perhaps.

In light of this historical event, may I suggest that we take a moment to reflect on ourselves and strive to get our own respective houses in order. Let's take a moment to reassess our own DevOps practices, and see what we can shore up before the inevitable happens.

Reassessing Your DevOps Practices

The Crowdstrike outage serves as a stark reminder that all systems, no matter how great or small, are not immune to failures. We can lean into some of the core tenets of the DevOps philosophy to guide our self-analysis.

Embrace a Culture of Continuous Improvement

DevOps is not a set-it-and-forget-it approach. And you're never really "finished" with a DevOps practice. As your business goals change and your technology matures, new challenges and unforeseen variability will constantly challenge your team. Create a culture that regularly reviews and refines your processes, tools, and practices. Encourage an environment where feedback is regularly given, welcomed, and acted upon.

Continuous Integration, Continuous Deployment

The build and integration phase of the software lifecycle is the most important time to scrutinize code. Automated testing, linting, and ad-hoc test environments are all excellent technical resources for catching bugs before they are released. But with any good DevOps practice, this foundational skill asks almost as much from the engineers as the machines. Training teams to integrate their code changes daily, reducing the amount of time an engineer writes code in isolation, and treating each change set as the code that can be deployed at any moment creates a constant and high-fidelity feedback loop. By bringing potential bugs and outage-creators to constant scrutiny, the net effect is that you'll ship more resilient code, more often.

Become an Incident-Centric Operation

Notice, though, I said "more often," not "always". The harsh reality is that even if you achieve the zen-like state of continuous integration and deployment, you will have incidents. Read that again.

If we know that an incident is always around the corner, we can invest in preparing our teams for the inevitable. Standardize a practice of incident response that prepares your team to respond in the most efficient way possible. Codify the practice of documenting incident timelines and post incident reports. Dedicate time for a wide group of your engineering teams to review incidents, and create plans of action to remediate them. Build out your product roadmap so that you prioritize fixing, triaging, and extending the learnings you've gained from the formalized practice of incident response. Set roles, responsibilities, and sustainable incident response schedules so your team can confidently navigate outages when they occur.

The investment you make in your incident response practice will create happier customers, resilient systems, cross-functional teams, and a shared ownership of the success of your business.

Invest In Monitoring and Observability

It's not sufficient to know that your Kubernetes cluster has oodles of headroom and can scale to global traffic if your customers are fed up with the constant work-arounds they have to implement to work with your application. Observability today is more than just knowing your machines are happy.

With increasing standardization in open source observability tools like OpenTelemetry, understanding your customer experience is as achievable as knowing when your virtual machine is going to run out of disk space. Introducing an observability stance that embraces tracing, logs, and metrics means you can see the incident before it comes knocking on your door.

Instrumenting your code to achieve observability, like all good practices, doesn't reside solely on the machine. The practice of instrumenting your code should feel very similar to your adherence to Test Driven Development (you are doing TDD, right?). Instrument the code before you implement it. Create fast feedback loops during the development process so your teams can see their metrics in your favorite observability platform as they build your next exciting feature.

Break Down Silos

I'd wager most blog posts about DevOps proclaim this same adage, in various contexts. Perhaps the frequency in which this directive is echoed is a signal for how extremely difficult it is to achieve. If there is one point of reflection to prioritize, above all others, I would say this is it.

It would be extremely difficult and irresponsible to prospectively say that there's "one weird trick" to breaking down silos for your specific challenges. To use my favorite phrase in consulting, the reality is, "it depends."

I will say, though that my experience as of late with the practices of DevOps, in all the forms that implement them, be it Site Reliability Engineering, Platform Engineering, Cloud Engineering, or any form of technology operations, is that we have a serious problem with silo building this subdiscipline.

The promise of DevOps was that we would bridge the chasm between getting production code out into the wild and actually generating it. My guess is many of our teams are working away in complete ignorance of the carefully crafted silo that they've inadvertently created for themselves.

This form of reflection takes the most honest scrutiny. Consider some prompts for reflection:

How aware is your DevOps team about another team's roadmap, OKRs, and development practices?

When was the last time you sat down and paired with an engineer on another team to help them debug an issue they've found in their code in a pre-production environment?

Do your senior, staff, and principal engineers on your DevOps teams attend stand ups, planning, and retros across the organization?

When was the last time you embedded one of your Cloud Engineers on a front end team?

As much as I hate to admit, this point of reflection is arguably the most negatively impacted by technology. For all of the other points I've raised, the challenges they face are augmented and improved by thoughtfully applying technology to facilitate efficiency in enacting a practice.

This one, though, is driven by humans, and for humans. But, with all of these reflection points, the benefits of this investment are massive.

Another critical outage is absolutely on the horizon. Don't waste your time with schadenfreude. Reassess your DevOps practices and prepare for the coming storm. Even if your own outages pale in comparison to the impact of the Crowdstrike incident, that outage is the most important because it is yours. Plan accordingly.

Micah Adams is a Principal DevOps Engineer at Focused Labs

Industry News

GitLab Duo with Amazon Q Released

April 17, 2025

GitLab announced the general availability of GitLab Duo with Amazon Q.

Perforce Delphix Partners with Liquibase

April 17, 2025

Perforce Software and Liquibase announced a strategic partnership to enhance secure and compliant database change management for DevOps teams.

Spacelift Launches Saturnhead AI

April 17, 2025

Spacelift announced the launch of Saturnhead AI — an enterprise-grade AI assistant that slashes DevOps troubleshooting time by transforming complex infrastructure logs into clear, actionable explanations.

CodeSecure Integrates with FOSSA

April 16, 2025

CodeSecure and FOSSA announced a strategic partnership and native product integration that enables organizations to eliminate security blindspots associated with both third party and open source code.

Bauplan Launches with $7.5 Million in Seed Funding

April 16, 2025

Bauplan, a Python-first serverless data platform that transforms complex infrastructure processes into a few lines of code over data lakes, announced its launch with $7.5 million in seed funding.

Perforce Introduces Kafka Service Bundle

April 15, 2025

Perforce Software announced the launch of the Kafka Service Bundle, a new offering that provides enterprises with managed open source Apache Kafka at a fraction of the cost of traditional managed providers.

LambdaTest Launches HyperExecute MCP Server

April 14, 2025

LambdaTest announced the launch of the HyperExecute MCP Server, an enhancement to its AI-native test orchestration platform, HyperExecute.

Cloudflare Announces Workers VPC and VPC Private Link

April 14, 2025

Cloudflare announced Workers VPC and Workers VPC Private Link, new solutions that enable developers to build secure, global cross-cloud applications on Cloudflare Workers.

Nutrient Expands Cloud-Based Services

April 14, 2025

Nutrient announced a significant expansion of its cloud-based services, as well as a series of updates to its SDK products, aimed at enhancing the developer experience by allowing developers to build, scale, and innovate with less friction.

Check Point Recognized for #1 AI-Powered Cyber Security Platform by Miercom

April 10, 2025

Check Point® Software Technologies Ltd.(link is external) announced that its Infinity Platform has been named the top-ranked AI-powered cyber security platform in the 2025 Miercom Assessment.

Orca Introduces Bitbucket App

April 10, 2025

Orca Security announced the Orca Bitbucket App, a cloud-native seamless integration for scanning Bitbucket Repositories.

Live API for Gemini Models in Preview

April 10, 2025

The Live API for Gemini models is now in Preview, enabling developers to start building and testing more robust, scalable applications with significantly higher rate limits.

Backslash Security Digital Twin Approach to Application Security Gains Traction as Legacy Tools Fall Short

April 09, 2025

Backslash Security(link is external) announced significant adoption of the Backslash App Graph, the industry’s first dynamic digital twin for application code.

SmartBear Releases API Hub for Test

April 09, 2025

SmartBear launched API Hub for Test, a new capability within the company’s API Hub, powered by Swagger.

Akamai Announces App & API Protector Hybrid

April 09, 2025

Akamai Technologies introduced App & API Protector Hybrid.

DEVOPSdigest

Reassessing Your DevOps Practices

Industry News

Upcoming Webinars

On-Demand Webinars

Analyst Reports

White Papers

Media Partners

The Latest

Hot Topics

Reassessing Your DevOps Practices

Related Links

Industry News

Search form

Upcoming Webinars

On-Demand Webinars

Analyst Reports

White Papers

Media Partners

User login

The Latest

Hot Topics