You’re migrating your database. It’s taken months for the process to take shape, hundreds of hours sank into building a compatibility pipeline, dozens of developers manually reviewing the data, meetings on weekends to meet the deadline. And then, during migration, a power surge brings down the system…
No one could have foreseen it, but that small oversight could sabotage the whole project, or even worse, cause irreparable damage to your database.
Humans are designed to be wary of the future, it’s a biological tool designed to keep us alive in the face of new experiences. But even then, we rarely, if ever, tend to think of the worst-case scenario.
But, in the words of Augustus De Morgan, “The first experiment already illustrates a truth of the theory, well confirmed by practice, what-ever can happen will happen if we make trials enough”. If it sounds vaguely familiar, it’s because De Morgan is often credited as one of the first to talk about Murphy’s law, “Anything that can go wrong will go wrong.”
Highly Unlikely Isn’t Impossible
A friend of mine jokingly says “when you go live, the question isn’t is it going to explode? Rather, what’s gonna be the color of the explosion?” Joking aside, there is more than a bit of truth to his words.
Development environments often try to recreate production environments as close to reality as possible or vice-versa. Unfortunately, computer systems are extremely volatile, so even the smallest difference can have far-reaching consequences, and that’s just the tip of the iceberg.
Systems break down, internet connections have latency, servers crash, hard drives fail, data gets corrupted. These things happen, and sometimes we have very little control over when or how. Remember back in 2011 when hundreds of users lost their data because of an AWS crash?
Is it likely to happen again? No, but is it possible? Yes, and that’s the reason why engineers always design failsafe systems. If NASA hadn’t designed a failsafe for their failsafe, Apollo 11 would have crashed on the moon due to a computer bug.
Apollo’s error code 1202 meant that the onboard computer was overloaded with tasks. Fortunately, NASA programmers foresaw this possibility and created a backup system that would quickly reboot the computer and free the memory for new calculations.
Minimizing Recovery Time
The moon landing’s story is a prime example of what modern engineers refer to as MTTR, minimizing the time it takes to recover from a failure. If catastrophes cannot be avoided, then our solution is to minimize the time it takes to bring the systems back up.
Let’s put it like this: imagine that you have two competing businesses, corporation A is experiencing several system outages throughout the day, while corporation B experienced one single outage. Without any further information, everyone would like to be corporation B.
But, let’s say corporation A’s MTTR (median time to recovery) is somewhere around 20 seconds, while corporation B’s is somewhere around 4 to 6 hours. If Corporation A had 20 outages throughout the day they would’ve had a total downtime of 6 to 10 minutes. Suddenly, the frequency of system outages seems a lot less important.
How do you minimize recovery time? Well, one of the first things to do is to purposefully crash your system. This is called controlled failure. While it may sound counterintuitive, once you think about it, it starts to make a lot of sense.
In a controlled failure, you announce a date and time when the system is going to fail, the failure itself isn’t revealed, so the team needs to diagnose the problem and get the system up and running as fast as possible.
While this is happening, we monitor system data before, during, and after the failure. This is meant to help with the recovery effort, but it also provides data for subsequent analysis and improvement.
This kind of exercise opens the door to new insights, as the team discovers the effects of unexpected flaws within the system. It’s a shift of perspective via shock therapy, as suddenly the failsafe system reveals just who fragile it actually is.
With the insights from these exercises, you can engineer new procedures with a clearer understanding of its flaw. While the exercise might be stressful, the outcome is more than worth it. And frankly, it can be one of the most intellectually challenging exercises for a development team.
Enter Chaotic Testing
Controlled failure exercises are just the tip of the iceberg and a basic introduction to chaotic testing/chaotic engineering. At its core, chaotic testing is simply creating the capability to continuously, but randomly, cause failures in your production system.
Chaotic Engineering was a core strategy of streaming giant Netflix. The engineering team had a wide array of “chaos monkeys” or potential failures that could spring up at any minute, from latency to a worldwide outage of Amazon Web Services.
These chaotic failures, in turn, force your team to shift from defensive development to a more aggressive approach. To be more precise, it’s a method to develop resiliency for the system and the team itself.
A resilient system has the flexibility to adapt to catastrophic circumstances, for example, a streaming platform that reroutes its traffic when a sudden change in latency causes a delay in data transmission.
A resilient team is flexible and open-minded, capable of quickly adapting and developing new strategies as they face unforeseen problems. Resilient teams tend to see emergencies as opportunities to grow and adapt, rather than fearing them or feeling stressed.
Resilience is something that can be built, both in terms of system and team dynamics, and chaotic testing pushes towards that mindset. It’s a bit like those fire drills we had back when we were kids. By simulating crisis we become accustomed to it and learn to keep our cool when things spiral out of control.
Keep in mind that this method is extremely demanding for developers, and it’s not recommended for newly formed teams or for small-scale projects. This is suited for projects with a lot of moving pieces, in which a single bug can have wide-ranging consequences.
Chaotic testing is actively recommended as one of the best methods to push towards resiliency and MTTR and is used by software and engineering giants like IBM.
Induce Chaos to Protect your Bussiness
Building with chaos may seem like an oxymoron, but one cannot deny the evidence. Netflix is by far one of the most solid systems on the planet, which is a testament to just how good Chaotic Testing can be when done right.