Applying Chaos Engineering to Traditional Systems

September 01, 2021

Tony Perez
Skytap

Imagine: it's 2011 and Netflix has introduced Chaos Monkey, a tool that injects arbitrary failures into their cloud architecture to pinpoint design flaws. Today, resiliency engineering has advanced so much that "Chaos Engineer" is an actual job title. Enterprises such as Amazon, Facebook and Google now use chaos to understand their architectures and distributed systems.

While chaos engineering is usually performed on cloud-native software, it can also be used to strengthen the dependability of traditional data center applications that may never move to the cloud. What kind of tests might you run on these applications? Some might be:

■ Low network bandwidth and/or high latency

■ Disc volumes full

■ Application code failure

■ Database/server down

■ Expired certificate(s)

■ Hardware failures

This can be accomplished by using the cloud, which allows IT to create a production-like environment that includes the original's exact application components. All technical infrastructure encapsulating a depiction of an application is called an "environment." Chaos testing can be performed on the cloud replica without affecting production code.

It isn't necessary to change components to "cloud-native" with this approach; simply lift-and-shift, keep the same lines of application code, and use the same servers as the original. To get the most value from chaos testing, reuse the RFC-1918 address spaces you're using on-prem in the cloud. Every major cloud service has some type of network address translation (NAT) system, enabling each cloud-based environment that could be using cloned address spaces to communicate with other on-prem resources to prevent IP address collision.

Setting Up Your Chaos Testing Workflow

One reason to use the cloud for chaos testing on-prem applications is the ability to do a fast reset of the system between test rounds. Your goal should be quickly resetting or re-creating the system in the cloud, allowing you to quickly run several chaos test scenarios without wasting time resetting between each one. Here is a workflow to prepare for chaos testing.

1. Import your on-prem environment to the cloud.

2. Once running, save your application so you can recreate on-demand clones.

When importing the on-prem environment to the cloud, the goal is to duplicate the original on-prem system exactly. All the volume's data, networks, VMs and storage must be included.

Next, your test workflow is:

1. Deploy a duplicate application from your template/scripts.

2. Run your chaos tests and collect the results.

3. Once tests are complete, delete the entire test environment.

4. When you're ready for the next test, return to step #1.

While cloud-based infrastructure won't totally mirror on-prem environments, there are some workarounds. Say the design and size of your storage array (SAN) can't be duplicated in the cloud, meaning you won't be allowed to test due to "failing the SAN." In this instance, you could disconnect or alter a disk linked to a VM to mimic a failure, all in the cloud.

Resetting Your Test Environments After Use

By replicating your traditional on-prem application in the cloud, you can run aggressive tests to determine solutions to common issues, thereby extending the life of the application. However, when the testing eventually ruins the cloud-based application clone, how will you reset for future test rounds? Manually fixing things can take ages, but with cloud-based testing you unlock an unending supply of clones.

Different clouds approach this in different ways. No matter the strategy, the aim is to quickly rebuild a ready-to-use set of infrastructure and application components representing the original application. Companies already doing "infrastructure-as-code" may have the tooling and scripts to replicate the system from nothing.

Note that cloning IP address space is hard to do on-prem; don't be tempted to "Re-IP" (re-assign IP addresses and hostnames) to servers to prevent collision with the originals. This approach means you've essentially changed the original system's representation, so your chaos tests may produce incorrect results due to mismatched hostnames and IP addresses.

What once seemed impossible is actually a simple, elegant approach to improving on-prem applications that will never see the cloud. The cloud provides a 24/7 sandbox for you to create and destroy things, then quickly recover without risking your original systems. This approach works for original application systems of record, disaster recovery systems, and software development pipelines, making it a one-stop testing shop for traditional applications.

Tony Perez is a Cloud Solutions Architect at Skytap

Industry News

webAI and MacStadium Announce Partnership to Power World's Largest AI Models with Apple Silicon

March 27, 2025

webAI and MacStadium(link is external) announced a strategic partnership that will revolutionize the deployment of large-scale artificial intelligence models using Apple's cutting-edge silicon technology.

Akamai Supports kernel.org

March 27, 2025

Development work on the Linux kernel — the core software that underpins the open source Linux operating system — has a new infrastructure partner in Akamai. The company's cloud computing service and content delivery network (CDN) will support kernel.org, the main distribution system for Linux kernel source code and the primary coordination vehicle for its global developer network.

Komodor Announces New Capabilities for Automating Kubernetes Drift Management

March 27, 2025

Komodor announced a new approach to full-cycle drift management for Kubernetes, with new capabilities to automate the detection, investigation, and remediation of configuration drift—the gradual divergence of Kubernetes clusters from their intended state—helping organizations enforce consistency across large-scale, multi-cluster environments.

Red Hat OpenShift AI 2.18 and Red Hat Enterprise Linux AI 1.4 Released

March 26, 2025

Red Hat announced the latest updates to Red Hat AI, its portfolio of products and services designed to help accelerate the development and deployment of AI solutions across the hybrid cloud.

CloudCasa by Catalogic Announces Latest Release

March 26, 2025

CloudCasa by Catalogic announced the availability of the latest version of its CloudCasa software.

BrowserStack Launches Private Devices

March 26, 2025

BrowserStack announced the launch of Private Devices, expanding its enterprise portfolio to address the specialized testing needs of organizations with stringent security requirements.

Chainguard Libraries Released in Beta

March 25, 2025

Chainguard announced Chainguard Libraries, a catalog of guarded language libraries for Java built securely from source on SLSA L2 infrastructure.

Cloudelligent Achieves AWS DevOps Competency Status

March 25, 2025

Cloudelligent attained Amazon Web Services (AWS) DevOps Competency status.

Platform9 Launches Partner Program

March 25, 2025

Platform9 formally launched the Platform9 Partner Program.

Cosmonic Launches Cosmonic Control

March 24, 2025

Cosmonic announced the launch of Cosmonic Control, a control plane for managing distributed applications across any cloud, any Kubernetes, any edge, or on premise and self-hosted deployment.

Oracle and Microsoft Add New Services to Oracle Database@Azure

March 20, 2025

Oracle announced the general availability of Oracle Exadata Database Service on Exascale Infrastructure on Oracle Database@Azure(link sends e-mail).

Perforce Acquires Snowtrack

March 20, 2025

Perforce Software announced its acquisition of Snowtrack.

Mirantis and Gcore Partner on AI Infrastructure

March 19, 2025

Mirantis and Gcore announced an agreement to facilitate the deployment of artificial intelligence (AI) workloads.

Amplitude Announces Session Replay Everywhere

March 19, 2025

Amplitude announced the rollout of Session Replay Everywhere.

Oracle Releases Java 24

March 18, 2025

Oracle announced the availability of Java 24, the latest version of the programming language and development platform. Java 24 (Oracle JDK 24) delivers thousands of improvements to help developers maximize productivity and drive innovation. In addition, enhancements to the platform's performance, stability, and security help organizations accelerate their business growth ...

DEVOPSdigest

Setting Up Your Chaos Testing Workflow

Resetting Your Test Environments After Use

Industry News

Upcoming Webinars

On-Demand Webinars

Analyst Reports

White Papers

Media Partners

The Latest

Hot Topics

Setting Up Your Chaos Testing Workflow

Resetting Your Test Environments After Use

Related Links

Industry News

Search form

Upcoming Webinars

On-Demand Webinars

Analyst Reports

White Papers

Media Partners

User login

The Latest

Hot Topics