In DevOps Incident Response, Plans Are Worthless, But Planning Is Everything

So said President Dwight D. “Ike” Eisenhower (more or less). His battles were fought in the trenches, not the technology stacks—but for DevOps teams, the principle holds. No plan survives contact with the enemy.

A plan is a set of instructions you can follow when you understand what needs to be done. Handy when the enemy is one you know, and can plan for. But the Enemy You Know isn’t what (literally) keeps DevOps engineers up at night.

Good DevOps teams competently respond to incidents, outages, and just plain weird stuff happening in their technology stack. Great teams go further—they see around corners, developing the instincts and skills to prepare for the unexpected. At Scalyr we start by reducing the risk of the dreaded 3 am call as much as possible. But we’d be foolish to stop there.

But Will It Plan?

Ike reminds us that “the very definition of emergency is that it is unexpected, therefore it is not going to happen the way you are planning.”

It’s critical to recognize that a plan built for the Enemy You Know will have limitations when dealing with the Enemy You Don’t. But that doesn’t mean you should throw out your playbooks and throw up your hands.

Instead of trying to build a response plan to cover every possible use case, build a plan for how to respond. As engineers, we do this in a time-tested way—by calling out our deliverables and breaking the work down, piece by piece, into components we can recognize and address.

We’ve honed our planning process through decades of DevOps experience, and in this post, you’ll view the DevOps environment through the lens of that experience. Read on to learn not only how to design an overall incident response strategy, but how to structure a resilient discipline for your team to handle any emergency.

Take Inventory (aka The Enemy You Know)

Begin the planning process by getting organized—take inventory to understand what you have, and what you need:

  • Choose your organizing medium
  • Take stock of plans, playbooks, information caches, and other artifacts
  • Identify gaps in team skills and experience

There are many options for organizing all of your relevant information. Make sure that the system you pick is collaborative, secure, highly available, and easy to add to. A wiki and Google Docs are both examples of things that would work well here.

Once you’ve chosen your organizing medium, it’s time to take stock. Of all the information available, what warrants inclusion in your library of response options? Inventory your information caches and for each asset, decide whether to keep it, modify it, or chuck it. Maybe you’ve inherited a repository—it’s time to dust off the items in the cupboard and see what’s there. Every artifact in an inherited toolkit is a message from the past that has potential value.

It’s Time to Get Personal

… professionally speaking. To ensure complete emergency coverage, you’ll need to realistically and critically assess your team before you start assigning responsibility. What skills and experience does each person bring? Are there knowledge gaps that could become major impediments in an emergency? Is there an appliance or service no one is comfortable tweaking? Do any services rely on external resources, and can those resources be reached 24/7? Has anyone key to the environment moved on recently?

You’re basically performing “gap analysis” here. By identifying your team’s strengths and addressing their weaknesses, you can ready them for an emergency with roles defined and processes in place.

Let No One Be a Human Single-Point-of-Failure

Build your roster of experts by assigning ownership of each key resource to a member of your team. Then, establish a lineup of second-in-commands to backstop each service in case your primary expert wins the lottery, or gets hit by an ice cream truck.

{“resources”:[ {“_id”: 1, “resource_name”: “www site”, “owner”: “Jill”, “backup”: “Jack”, “resources”: “http://devopsdocs.software”, “audience”: “marketing@ourslickcompany.com”} ]}

Once you’ve identified your experts and seconds-in-command, create an appropriate alerting strategy that notifies the right people for each potential problem. High signal to noise ratio is the key to keeping the focus on the important details. Avoid spamming team members with endless automated messages, or you’ll risk creating alert fatigue.

It Starts with DevOps

Inventory of plans and playbooks—check. Visibility into the metrics that matter—check. Focus on the most important resources—check. But you’re not done yet. Organizational preparedness for incident planning starts with DevOps, but it doesn’t end there. We are in service to something larger than us—our internal and external stakeholders, and ultimately, our end users. With that in mind, structure your communications strategy to serve the collective good.

Who Needs to Know What, and When?

Some infrastructure components will affect a specific group of people. Connecting the dots between component and stakeholder now will ensure you keep them in the loop later, and save you from under- or over-sharing critical information when you’re in emergency mode.

Build out a distribution list of the downstream stakeholders for each key resource alongside your roster of experts and backups. Next, prepare the appropriate distribution channels now so you can quickly disseminate information later. Communications to these folks can come in a few flavors—social media, a status page, and customer distribution lists are all valid ways to broadcast messages widely. Communications should be accurate, relevant, and timely. (And in all cases, if an event or issue needs full disclosure, be transparent and stick to the facts.)

Beyond ensuring coverage in the event of ice cream truck mayhem, your secondary experts for each functional area can play a critical role in these stakeholder communications. The balance between fixing problems and keeping stakeholders updated is a familiar tension. Your second-in-command can manage the flow of information from problem solver to stakeholder, translating technical details into pertinent status updates.

Save your communications to review and iterate on during future debrief sessions. Tweaking them over time will continually improve their quality and can add to a library of responses for future use.

This Is Only a Test

Engineers should practice the art of gathering relevant information and applying it quickly, decisively, and properly in a variety of situations. “Keep yourselves steeped in the character of the problem you may one day be called upon to solve” (this Eisenhower guy has some solid advice) and you will be ready for any emergency that comes your way.

Planning is the practice of thinking through scenarios that can occur and creating an appropriate response. Your team can test drive those scenarios, to show how specific responses will lead to good outcomes. The process of making a plan or playbook trains your brain to respond well, and the most effective strategies are the result of testing, refining, and socializing across your teams. As an added benefit, you can use the results to improve your existing documentation.

To test the efficacy of your plans and playbooks, quiz team members on hypothetical scenarios, and, if possible, conduct actual drills in a test environment.

It Takes an Incident

Running hypothetical scenarios past your team is a low-risk way to prepare for an emergency. Using a playbook as a guide, have one team member quiz another, starting from a possible situation like, “You receive an alert from the message queue service. The message states that the primary queue is over the warning threshold of 1,000 items. What actions do you take?”

Playing out hypothetical situations gives you insight into how team members will react, what blind alleys are common, and ways to optimize your playbook, while keeping plans fresh, relevant, and visible to the community.

Another more time-consuming but ultimately more accurate test is to initiate specific events within a staging environment. With a duplicate environment, running under load and being monitored using the same tools as production, you can run realistic evaluation and response drills.

This type of test is more work, but you’ll get a great feel for what can happen, and the possibility you’ll encounter an unforeseen detail is higher. This added element of realism will make the entire exercise more valuable than a simple thought experiment. Just make sure your testing environment is fully sandboxed, and that anyone receiving notifications is aware of what’s going on. No false alarms, please!

When It’s Not a Test

These guidelines will be valuable in preparing your team and your playbooks for whatever comes your way. Then, when the inevitable happens (and you know it will), you’ll be ready. Can following our guidelines substitute for decades of experience? You tell us — share your own experience in the comments section.