Top Kubernetes Challenges Facing DevOps Teams
October 21, 2025

Udi Hofesh
Komodor

With nearly 80% of organizations now running Kubernetes in production, adoption is nearly universal across industries. Yet the 2025 Komodor Enterprise Kubernetes Report shows that while Kubernetes itself is mature, enterprise operations often are not.

For DevOps teams, the findings highlight the realities of running Kubernetes at scale: instability from constant change, widespread overspending, tool sprawl, and persistent skills gaps. Let's dig into the trends that matter most for practitioners.

Change Is Still the Leading Cause of Instability

The data confirms what many engineers already know: most outages start with a change. According to the report, 79% of production incidents originate from a recent system change. Whether it is a new deployment, a human tweak to the environment, or a third-party failure, change remains the root cause of instability.

Even worse, detection and recovery remain slow. The median time to detect a high-impact outage is 37 minutes, with resolution taking another 51 minutes on average. That adds up to nearly 270 hours per year spent on incident detection and resolution.

In environments focused on continuous delivery, this is a stark reminder that speed without guardrails is expensive. Policy as code, automated drift detection, and GitOps practices are becoming essential to prevent misconfigurations before they hit production.

Overspending Is Widespread

The report found that more than 82% of Kubernetes workloads are overprovisioned. Over 65% consume less than half of their requested CPU and memory. In plain terms, teams are asking for far more capacity than they use, and the cloud bill reflects it.

This is not just about waste. Overprovisioning distorts scheduling, hides hotspots, and complicates scaling decisions. At the same time, only 7% of workloads have accurate requests and limits.

The findings underscore the importance of right-sizing. Properly configured admission controllers, node autoscaling, and predictive policies need to become part of the deployment pipeline, not an afterthought.

Tool Sprawl vs. Unified Observability

Most organizations use more than one monitoring or APM tool, but the report found that tool sprawl leads to fragmented data and alert fatigue rather than true observability. Teams using unified telemetry platforms report fewer outages, faster recovery, and even lower costs.

This is one area where DevOps practices can make or break operations. Building a single pipeline for logs, metrics, traces, and events not only reduces noise but also provides the data foundation for AIOps. AI-assisted operations, already used by 35% of organizations and planned by another 40% by 2026, depend on integrated observability to deliver accurate anomaly detection and root cause analysis.

GitOps and Helm Dominate

The report confirms that GitOps has become the default deployment model, with 84% of organizations using tools like ArgoCD and Flux. Helm remains nearly universal, with 95% of teams templating and packaging applications with charts.

This validates the shift toward declarative, version-controlled infrastructure. The combination of Helm and GitOps is becoming the "duopoly" for managing delivery pipelines, offering both speed and repeatability. But it also increases the pressure to maintain secure templates and enforce consistent policies across clusters. Keeping up with chart versions and consistency across environments is also a constant struggle.

Scale, Hybridity, and the Edge

The average enterprise now runs more than 20 clusters, with nearly half spanning four or more environments, including on-premises, public clouds, and edge. For some, that means 100 or even 1,000 clusters.

Hybrid cloud has become the default operating model. Lightweight Kubernetes distributions are pushing workloads to the edge, while AI and ML workloads, especially inference, are driving demand for GPU scheduling.

This means multi-cluster complexity is here to stay. Policy drift, inconsistent configs, and cross-environment failover are daily risks. GitOps, multi-cluster observability, and standardized golden paths for developers are becoming survival strategies.

The Persistent Skills Gap

Perhaps the most sobering finding: skills remain the number one bottleneck. Even with widespread adoption of GitOps and platform engineering, the lack of experienced Kubernetes practitioners slows deployments, complicates cost management, and extends outages.

This explains why 68% of organizations have created dedicated platform teams.

What This Means for DevOps

The report's findings expose the clear need for greater operational discipline, which is less about introducing new tools and more about using existing practices consistently and at scale. Here are some best practices to consider:

■ Harden the change pipeline: Enforce policy as code and automated drift detection.

■ Right size workloads: Use autoscaling policies and enforce CPU and memory requests.

■ Unify observability: Build a single telemetry pipeline and prepare for AI assisted ops.

■ Codify incident workflows: Version control runbooks, automate remediations, and rehearse failover.

■ Build golden paths: Provide developers with secure, pre-approved templates and workflows.

It's clear that Kubernetes has grown up. And now it's time for operations to catch up.

Udi Hofesh is Technical Product Marketing & Developer Relations Manager at Komodor
Share this

Industry News

November 06, 2025

Check Point® Software Technologies Ltd. announced it has been named as a Recommended vendor in the NSS Labs 2025 Enterprise Firewall Comparative Report, with the highest security effectiveness score.

November 06, 2025

Buoyant announced upcoming support for Model Context Protocol (MCP) in Linkerd to extend its core service mesh capabilities to this new type of agentic AI traffic.

November 06, 2025

Dataminr announced the launch of the Dataminr Developer Portal and an enhanced Software Development Kit (SDK).

November 05, 2025

Google Cloud announced new capabilities for Vertex AI Agent Builder, focused on solving the developer challenge of moving AI agents from prototype to a scalable, secure production environment.

November 05, 2025

Prismatic announced the availability of its MCP flow server for production-ready AI integrations.

November 05, 2025

Aptori announced the general availability of Code-Q (Code Quick Fix), a new agent in its AI-powered security platform that automatically generates, validates and applies code-level remediations for confirmed vulnerabilities.

November 04, 2025

Perforce Software announced the availability of Long-Term Support (LTS) for Spring Boot and Spring Framework.

November 04, 2025

Kong announced the general availability of Insomnia 12, the open source API development platform that unifies designing, mocking, debugging, and testing APIs.

November 04, 2025

Testlio announced an expanded, end-to-end AI testing solution, the latest addition to its managed service portfolio.

November 03, 2025

Incredibuild announced the acquisition of Kypso, a startup building AI agents for engineering teams.

November 03, 2025

Sauce Labs announced Sauce AI for Insights, a suite of AI-powered data and analytics capabilities that helps engineering teams analyze, understand, and act on real-time test execution and runtime data to deliver quality releases at speed - while offering enterprise-grade rigorous security and compliance controls.

October 30, 2025

Tray.ai announced Agent Gateway, a new capability in the Tray AI Orchestration platform.

October 30, 2025

Qovery announced the release of its AI DevOps Copilot - an AI agent that delivers answers, executes complex operations, and anticipates what’s next.

October 29, 2025

Check Point® Software Technologies Ltd. announced it is working with NVIDIA to deliver an integrated security solution built for AI factories.

October 29, 2025

Hoop.dev announced a seed investment led by Venture Guides and backed by Y Combinator. Founder and CEO Andrios Robert and his team of uncompromising engineers reimagined the access paradigm and ignited a global shift toward faster, safer application delivery.