Netflix, Chaos Monkey, and Preparing for the Worst

September 27, 2024

On April 21, 2011, an entire Amazon AWS availability zone went down, taking a large chunk of the internet down with it. Companies like Reddit, Foursquare, and Quora lost their internet presence with no idea how long it would take to get it back. For some companies, a crash of longer than an hour can cost hundreds of millions of dollars.

Netflix, one of Amazon’s biggest AWS customers, however, was left unscathed by the outage.

A few years before the outage, the IT guys at Netflix decided that it would be good to prepare for the worst. What would happen if some of our servers went down? What should we do if traffic spikes and our resources can’t handle it? A few key technical solutions were decided upon and implemented to prepare Netflix for the worst.

One of them is Chaos Monkey.

Chaos Monkey is a program that looks at all AWS resources used by Netflix and starts randomly shutting them down to see what would happen in a controlled environment away from real customers. The initial results were, as intended, chaos. The site went down, features stopped working, and all manner of unforeseeable problems occurred. Netflix now had a long list of issues to fix that were previously hidden. By forcing the system to break, engineers were able to patch newly found holes, create redundancies, and prevent those same outages from happening again. Three years later, when AWS lost an availability zone, Netflix was ready. It took most companies a whole day to be back to their full capacity. For Netflix, one day is 200 million hours of streaming, and their customers didn’t miss a minute of it.

Are you prepared for outages?

While we can’t run an automated program to detect all of the potential, unseen problems in our lives, we can create our own personal Chaos Monkey by brainstorming the things that could go wrong in a given situation or project.

In business, this is called a premortem; imagine what can go wrong, then work backwards to identify the source of failure and prevent it. An ounce of temporary pessimism followed by some problem solving and fortification can help you weather future storms and protect the things that are most important to you.

This applies on both a personal and a professional level.

What would happen if you lost your job? If you weren’t admitted to your chosen university? What if a project took twice as long to complete as initially estimated, or someone at work quit without warning? If you identify something that is out of your control, you can prepare yourself to withstand an outage. If you identify something inside your control, you can work to prevent and outage.

In the biblical Genesis, the story of the first Joseph is laid out. Joseph can see glimpses of the future by interpreting dreams. His brothers don’t like that very much, so they sell him to some slavers (talk about sibling rivalry). The slavers take him some 250 miles away to Egypt where he is sold into servitude. Joseph’s talent for interpreting dreams is eventually discovered. A while later, Pharaoh has a pair of recurring dreams that he can’t shake. Seven healthy cows are swallowed up by the Nile River, and seven sickly cows take their place. The same happens with crops. Pharaoh hears about Joseph, the slave who can interpret dreams, and tells him about the dream.

The result is one of the first recorded premortem discussions in history. Joseph says that there will be seven years of plenty, with lots of crops, cows, rain, and prosperity. After those good years, Egypt will be hit with seven years of famine. If the Egyptians play their cards right, they can prepare during the time of harvest and prevent the famine from ravaging their society.

Failure and success are lagging indicators. Your actions build up over time and lead to a particular result. When you prepare for famine, whether it be of food and water or AWS resources, you can prevent catastrophe.

What things are out of your control? What can you do to prevent catastrophic events from ruining your plans, your systems and your life?

What eventual outcomes are in your control? What measures can you take now to prevent those things from happening?

Are you prepared for outages?

Subscribe to the Food For Thought email list for weekly articles on practical life lessons from the worlds of technology, business, literature, and music.