The Myth of the Root Cause: How Complex Web Systems Fail

Editor’s note: here at Scalyr, robust systems are a topic near and dear to our heart, so we were happy to have the chance to work closely with Mathias on this piece. It’s based on the two-part series “How Complex Web Systems Fail” originally published on his Production Ready mailing list.

how-complex-systems-fail

Distributed web-based systems are inherently complex. They’re composed of many moving parts — web servers, databases, load balancers, CDNs, and many more — working together to form an intricate whole. This complexity inevitably leads to failure. Understanding how this failure happens (and how we can prevent it) is at the core of our job as operations engineers.

In his influential paper How Complex Systems Fail, Richard Cook shares 18 sharp observations on the nature of failure in complex medical systems. The nice thing about these observations is that most of them hold true for complex systems in general. Our intuitive notions of cause-and-effect, where each outage is attributable to a direct root cause, are a poor fit to the reality of modern systems.

In this post, I’ll translate Cook’s insights into the context of our beloved web systems and explore how they fail, why they fail, how you can prepare for outages, and how you can prevent similar failures from happening in the future…

Failure Is Always Just Around the Corner

Sooner or later, any complex system will fail, and web systems are no exception. Failure can occur anytime and almost anywhere. So you should never get too comfortable.

The complexity of web systems ensures there are multiple flaws — latent bugs — present at any given moment. We don’t — and can’t — fix all of these, both for economic reasons and because it’s hard to picture how individual failures might contribute to a larger incident. We’re prone to think of these individual defects as minor factors, but seemingly minor factors can come together in a catastrophe.

In October 2012, AWS suffered a major outage in its US-East region caused in part by a latent memory leak in the EBS server data collection agent. The leak was seemingly minor, but two more minor issues (the routine replacement of a single data collection server, and the failure of an internal DNS update to redirect traffic away from that replaced server) combined to bring the whole region down for several hours.

Complex systems run as broken systems by default. Most of the time, they continue to work thanks to various resiliency measures: database replicas, server auto scaling, etc. And of course, thanks to good monitoring and alerting, coupled with knowledgeable operators who fix problems as they arise. But at some point systems will fail. It’s inevitable.

After an incident has occurred, a postmortem might find that the system has a history of “almost incidents” and that operators should have recognized the degradation in system performance before it was too late. However, that’s an oversimplified view. System operations are dynamic: Human beings are replaced, failing components are replaced, and resource usage can vary.

In its postmortem of the 2012 outage, Amazon pointed out that EBS Servers generally made very dynamic use of their memory, so a memory leak in the data collection agent didn’t otherwise stand out from normal fluctuations. Both the memory leak and the ability to detect such leaks were fixed following the incident, and it might be tempting to point to those as the root cause. But attribution is not that simple, as you’ll see in a minute.

The Myth of the Root Cause

In complex web systems, there is no root cause. Single-point failures alone are not enough to trigger an incident. Instead, incidents require multiple contributors, each necessary but only jointly sufficient. It is the combination of these causes — often small and innocuous failures like a memory leak, a server replacement, and a bad DNS update — that is the prerequisite for an incident. We therefore can’t isolate a single root cause. One of the reasons we tend to look for a single, simple cause of an outcome is because the failure is too complex to keep in our head. Thus we oversimplify without really understanding the failure’s nature, and seize on a single factor as the root cause.

Hindsight bias continues to be the main obstacle to incident investigation. This cognitive bias, also known as the knew-it-all-along effect, describes the tendency of people to overestimate their ability to have predicted an event, despite the lack of objective evidence. Indeed, hindsight bias makes it impossible to accurately assess human performance after an incident. Still, many companies continue to blame people for mistakes when they should really blame — and fix — their broken processes.

The same companies also tend to restrict activities that can cause similar incidents in the future. This misguided attempt to remedy “human error” reminds me, more than anything else, of airport security theater, which has travelers around the world putting their toiletries into little baggies and walking through body scanners barefoot despite little evidence that this actually prevents security incidents.

Poorly-thought-out mitigation measures can do more than waste effort; they can actually make the system more complex, brittle, or opaque. For instance, Amazon could have responded to the 2012 outage by instituting a rule that all internal DNS caches must be flushed after replacing a data collection server. This would prevent an exact repeat of the incident, but could also introduce new ways for the system to fail.

A similar but different cognitive error is outcome bias, which refers to the tendency to judge a decision by its eventual outcome. We have to remember that every outcome, successful or not, is the result of a gamble. There are things we believe we know (e.g., because we built the system in a such-and-such way), there are things we don’t know, and there are even things we don’t know we don’t know. The overall complexity of our web systems always poses unknowns. We can’t eliminate uncertainty — the guessing of what might be wrong and what might fix it.

The Role of Human Operators

Regardless of the level of instrumentation and automation, it’s still people — human operators — that keep web systems up and running. They make sure these systems stay within the boundaries of acceptable performance, and failure-free operations are the product of their activities. Most of these activities are well-known processes (best documented in playbooks), such as reverting a bad deployment. Sometimes, however, a novel approach is required to repair a broken system. The latter is in particular the case with irreversible failures, where you can’t simply undo the action that caused it.

When Amazon experienced an outage of its DynamoDB service in September 2015, the initial triggering event — a network disruption — was quickly repaired. However, the system was then stuck in a state where an integral service was overloaded, and Amazon’s operators were forced to invent a lengthy procedure — on the fly — to steer the system back to a stable state.

The same operators incrementally improve systems — adapt them to new circumstances — so that they can continue to survive in production. These adaptations include, for example:

  • Decoupling of system components to reduce exposure of vulnerable parts to failure.
  • Capacity planning to concentrate resources in areas of expected high demand.
  • Graceful error handling and periodic backups to recover from expected and unexpected faults.
  • Establishing means for early detection of changed system performance.

Operators have two roles: building, maintaining, and improving the system; and responding to failures. This duality is not always acknowledged. It’s important to maintain a balance. If the “building” role is too large, there won’t be time to properly respond to failures, and the temptation will be to ignore nascent problems. If failure response takes up too much time, the system will be under-maintained and gradually suffer.

Change Introduces New Forms of Failure

Even deliberate changes to web systems will often have unintended negative consequences. There’s a high rate of change and often a variety of processes leading to those changes. This makes it hard — if not impossible — to fully understand how all the pieces interact under different conditions. This is a major reason why outages are both unavoidable and unpredictable.

Operators have to deal with ever-changing failures due to advances in technology, evolving work organization, and — paradoxically — efforts to eliminate failures. The low rate of major incidents in reliable systems may encourage efforts to eliminate low-consequence but high-frequency failures. But these changes might actually lead to a higher number of new, low-frequency but high-consequence failures. As these failures occur at a low rate, it’s difficult to see which changes have contributed to them.

Google’s April 2016 GCE outage provides an example — an intentional action by engineers to perform a routine removal of an unused IP block surfaced a latent bug in the automated configuration management system. This bug triggered a fail-safe in the system (ironic, eh?) that removed all IP blocks from the network config and, for 18 minutes, took the whole GCE platform offline. We use automation specifically to avoid errors that could bring the network down, and yet here we saw an equally explicit unintended consequence.

Actions at the Sharp End

More often than not, companies don’t have a clear policy regarding acceptable risks of incidents. In the absence of hard numbers, decisions are often made following someone’s gut feeling.

This ambiguity is resolved by actions at the sharp end of the system. After a disaster has struck in production, we’ll know, for example:

  • Management’s response to failure
  • What went well and what went wrong
  • Whether we need to hire another Site Reliability Engineer
  • Whether we should invest in employee training or better equipment

In other words, we’re forced to think and decide.

Once again, we need to be cautious of hindsight bias and its friends, and never ignore other driving forces, especially production pressure, after an incident has occurred.

Experience with Failure Is Essential

When failure is the exception rather than the rule, we risk becoming complacent. Complacency is the enemy of resilience. The longer you wait for disaster to strike in production — simply hoping that everything will be okay — the less likely you are to handle emergencies well, both at a technical and organizational level. Building resilient systems requires experience with failure. Learning from outages after the fact is important, but it shouldn’t be the only method for acquiring operational knowledge.

Waiting for things to break is not an option. We should rather trigger failures proactively — through intentional actions at the sharp end — in order to prepare for the worst and gain confidence in our systems. This is the core idea behind Chaos Engineering and related practices like GameDay exercises. If you want to learn more about Chaos Engineering, here are some links to check out:

Conclusion

Complex systems are intrinsically hazardous systems. While most web systems fortunately don’t put our lives at risk, failures can have serious consequences. Thus, we put countermeasures in place — backup systems, monitoring, DDoS protection, playbooks, GameDay exercises, etc. These measures are intended to provide a series of overlapping protections. Most failure trajectories are successfully blocked by these defenses, or by the system operators themselves.

Even with good defenses in place, complex web systems do break. But all is not lost. By acknowledging the inevitability of failure, avoiding simplistic postmortems, learning to embrace failure, and other measures described in this post, you can continually reduce the frequency and severity of failures.

Obligatory plug: Scalyr’s log management service ties powerful log analysis to a flexible dashboard and alerting system. Fast and easy to use, it encourages the kind of deep investigation and proactive monitoring that can help keep your systems on the right side of the chaos.


Mathias Lafeldt is particularly interested in Site Reliability Engineering, Chaos Engineering, and Systems Thinking. He regularly publishes thoughtful articles on any of these topics. Follow him on Twitter here

  • dennyzhang.com

    Very structured analysis of system failure.

    Here are improvements that I think are helpful and not effectively enforced:
    1. Make all suspicious symptons apparent and noticeable.
    2. Exercise more Chaos Monkey tests/rehearsal.