In DevOps Incident Response, Plans Are Worthless, But Planning Is Everything

So said President Dwight D. “Ike” Eisenhower (more or less). His battles were fought in the trenches, not the technology stacks—but for DevOps teams, the principle holds. No plan survives contact with the enemy.

A plan is a set of instructions you can follow when you understand what needs to be done. Handy when the enemy is one you know, and can plan for. But the Enemy You Know isn’t what (literally) keeps DevOps engineers up at night.

Good DevOps teams competently respond to incidents, outages, and just plain weird stuff happening in their technology stack. Great teams go further—they see around corners, developing the instincts and skills to prepare for the unexpected. At Scalyr we start by reducing the risk of the dreaded 3 am call as much as possible. But we’d be foolish to stop there.

Read More

The Myth of the Root Cause: How Complex Web Systems Fail

Editor’s note: here at Scalyr, robust systems are a topic near and dear to our heart, so we were happy to have the chance to work closely with Mathias on this piece. It’s based on the two-part series “How Complex Web Systems Fail” originally published on his Production Ready mailing list.

how-complex-systems-fail

Distributed web-based systems are inherently complex. They’re composed of many moving parts — web servers, databases, load balancers, CDNs, and many more — working together to form an intricate whole. This complexity inevitably leads to failure. Understanding how this failure happens (and how we can prevent it) is at the core of our job as operations engineers.

In his influential paper How Complex Systems Fail, Richard Cook shares 18 sharp observations on the nature of failure in complex medical systems. The nice thing about these observations is that most of them hold true for complex systems in general. Our intuitive notions of cause-and-effect, where each outage is attributable to a direct root cause, are a poor fit to the reality of modern systems.

In this post, I’ll translate Cook’s insights into the context of our beloved web systems and explore how they fail, why they fail, how you can prepare for outages, and how you can prevent similar failures from happening in the future…

Read More

So You’ve Been Paged: A Guide to Incident Response (For Those Who Hate Being Paged)

One of the inevitable joys of working in DevOps is “the page” — that dreaded notification from your alerting system that something has gone terribly wrong…and you’re the lucky person who gets to fix it.

Here at Scalyr, we’ve got a few decades of collective DevOps experience and we’ve all been on the receiving end of a page. Even though we do our best to avoid being woken up, it happens.

In this post, we’re going to put some of that experience to use and show you how to handle an incident the right way. You’ll learn not only how to fix the immediate problem, but how to grow from the experience and set your team up for smooth sailing in the future.

Ugh...Phones.

Read More