Chaos Monkey Hiring Guide

In most industries, chaos is a negative thing. In the world of chaos engineering, though, the notion of chaos is a helpful, practical, and insightful tool. Chaos engineering helps developers engineer computer systems that are more resilient and have fewer weak points than traditional testing and engineering. Chaos Monkey is the most commonly used tool to create such “chaos.”

Netflix’s engineering team developed Chaos Monkey after moving their systems to the cloud in 2010. This new cloud-based environment meant that hosts could be terminated and replaced at any random time, which led to the need to prepare for such constraints. The engineering team then created the idea of testing through the random rebooting of their own hosts. This allowed Netflix to find any possible weaknesses while also validating that their own remediation automation worked correctly, enhancing the demand for Chaos Monkey development services.

Chaos Monkey Hiring Guide 7

Hiring Guide

Netflix designed Chaos Monkey as their own version of chaos engineering to test their system stability by enforcing failures in a pseudo-random execution of services and instances within their cloud architecture. Through this intentionally created chaos and the power of chaos engineering, developers and engineers have the ability to see how systems respond when critical components of their infrastructure are taken down.

At its core, chaos engineering and Chaos Monkey itself tell developers how well a system shifts its resources when faced with an outage. This is especially helpful in cloud computing instances within Amazon Web Services servers. Chaos Monkey randomly terminates instances within a virtual machine and the containers that run inside of a production environment to expose failures more frequently and help build resilient services.

A configurable schedule allows for simulated failures to occur at specified times so that developers have the ability to closely monitor them. This helps prepare for major, unexpected errors rather than simply waiting around for a catastrophe and reacting after the fact. Typically, chaos engineering generally follows 4 steps in testing:

Engineers and developers define the “steady state” as a measurable output of a system to set as a baseline for normal behavior.
Teams then hypothesize how this steady state will continue and react in both the control and experimental groups during failure simulation.
The engineers introduce variables to reflect issues and real-world events that would cause catastrophic failure, such as crashes, hard drive malfunctions, severed network connections, and so on.
After witnessing the system reactions, the team then tries to disprove the hypothesis by looking for differences between the control and experimental groups.

Generally, the more difficult it is to disrupt the steady-state of the system, the more confidence businesses and development teams have in the system for uptime and user experience. The field of chaos engineering, and specifically Chaos Monkey, is still relatively new but these types of system and software testers are in demand for larger companies in need of knowing that their systems are fully operational no matter the situation or external factors associated with cloud computing.

Interview Questions

How does an engineer build a hypothesis to test with Chaos Monkey around steady-state behavior?

Focusing on the measurable output of the system for testing purposes, rather than on the internal attributes. The overall system’s error rates, latency percentiles, throughput, and so on could all be possible metrics of interest in determining the steady-state behavior.

Measuring the output over a relatively short period of time also constitutes a proxy for the system’s steady state. By working in this way, “chaos” will verify that the system does work through the focus on systemic behavioral patterns during the experiments instead of validating how it works.

Is it advisable to manually or automatically run experiments in Chaos Monkey and chaos engineering?

Although manually executing experiments helps developers create and witness system reactions, it’s labor-intensive and ultimately not scalable or sustainable for a team. A better practice in chaos engineering is to automate experiments and run them on a continuous basis. Chaos engineering typically builds automation into the system to drive both the building of both experiment creation and result analysis.

What are some of the cons of using Chaos Monkey?

Chaos Monkey is incredibly beneficial but does come with some drawbacks. It requires the use of MySQL 5.X and doesn’t support deployments managed in anything other than Spinnaker. It only offers a limited scope of testing in that it injects one type of failure at a time to produce a random instance failure as “long tail” failures experienced during the lifecycle of the software or program.

Chaos Monkey also doesn’t have a real user interface and requires execution through the command line, scripts, and configuration files. Arguably, its biggest downside is the fact that it offers no recovery capabilities. Chaos engineering encourages performing the smallest possible experiments at first to contain repercussions and for engineering teams to work their way up from there to prevent total system failure.

Job Description

We are seeking an experienced engineer responsible for chaos engineering through the use of Chaos Monkey. This position includes the design and execution of chaos and load testing to stress test highly performant systems, software, and applications.

The right candidate will use their knowledge of application frameworks and containerization technologies to design, manage, and maintain the programs developed to stress and determine the sustainability and reliability of critical systems. You should be a highly motivated individual ready to deliver swift tests while working in an agile fashion to deliver reliable solutions to drive business needs.

Job Responsibilities

Maintain and improve company-wide Chaos and reliability testing of technology platforms
Run full-scale critical path testing against all platforms
Create performance plans and models for highly scalable, low-latency, highly available application, and infrastructure systems
Actively contribute to capacity planning and disaster recovery preparation exercises
Monitor application performance, optimize performance bottlenecks and manage usage to create capacity models
Partner with development teams to identify and create fallback plans for critical scenarios

Job Qualifications

Bachelor’s degree in Computer Science
5+ years experience in software engineering, MySQL, Golang, and relevant programming languages
4+ years of relevant work experience in chaos engineering
In-depth knowledge of Chaos Monkey best practices within various domains, including applications, networks, databases
Experience in monitoring strategies including real users, synthetic, network connections, and so on