This is the first in a series of articles on server monitoring techniques. If you’re responsible for a production service, this series is for you.
In this post, I’ll describe a technique for writing alerting rules. The idea is deceptively simple: alert when normal, desirable user actions suddenly drop. This may sound mundane, but it’s a surprisingly effective way to increase your alerting coverage (the percentage of problems for which you’re promptly notified), while minimizing false alarms.
Most alerting rules work on the opposite principle — they trigger on a spike in undesired events, such as server errors. But as we’ll see, the usual approach has serious limitations. The good news: by adding a few alerts based on the new technique, you can greatly improve your alerting coverage with minimal effort.
Why is this important? Ask Microsoft and Oracle.
With almost any monitoring tool, it’s pretty easy to set up some basic alerts. But it’s surprisingly hard to do a good and thorough job, and even the pros can get it wrong. One embarrassing instance: on October 13, 2012, for about an hour, Oracle’s home page consisted entirely of the words “Hello, World”. Oracle hasn’t said anything about this incident, but the fact that it was not corrected for over an hour suggests that it took a while for the ops team to find out that anything was wrong — a pretty serious alerting failure.
A more famous example occurred on February 28th, 2012. Microsoft’s Azure service suffered a prolonged failure during which no new VMs could be launched. We know from Microsoft’s postmortem that it took 75 minutes for the first alert to trigger. For 75 minutes, no Azure VM could launch anywhere in the world, and the Azure operations team had no idea anything was wrong. (The whole incident was quite fascinating. I dissected it in an earlier post; you can find Microsoft’s postmortem linked there.)
This is just the tip of the iceberg; alerting failures happen all the time. Also common are false positives — noisy alerts that disrupt productivity or teach operators the dangerous habit of ignoring “routine” alerts.
If you see something, say something
If you’ve ridden a New York subway in recent years, you’ve seen this slogan. Riders are encouraged to alert the authorities if they see something that looks wrong — say, a suspicious package left unattended. Most monitoring alerts are built on a similar premise: if the monitoring system sees something bad — say, a high rate of server errors, elevated latency, or a disk filling up — it generates a notification.
This is sensible enough, but it’s hard to get it right. Some challenges:
- It’s hard to anticipate every possible problem. For each type of problem you want to detect, you have to find a way to measure instances of that problem, and then specify an alert on that measurement.
- Users can wake you up at 3:00 AM for no reason. It’s hard to define metrics that distinguish between a problem with your service, and a user (or their equipment) doing something odd. For instance, a single buggy sync client, frantically retrying an invalid operation, can generate a stream of server “errors” and trigger an alert.
- Operations that never complete, never complain. If a problem causes operations to never complete at all, those operations may not show up in the log, and your alerts won’t see anything wrong. (Eventually, a timeout or queue limit might kick in and starting recording errors that your alerting system can see… but this is chancy, and might not happen until the problem has been going on for a while.) This was a factor in the Azure outage.
- Detecting incorrect content. It’s easy to notice when your server trips an exception and returns a 500 status. It’s a lot harder to detect when the server returns a 200 status, but due to a bug, the page is malformed or is missing data. This was presumably why Oracle didn’t spot their “Hello, World” homepage.
Let your users be your guide
The challenges with traditional alerting can be summarized as: servers are complicated, and it’s hard to distinguish desired from undesired behavior. Fortunately, you have a large pool of human experts who can make this determination. They’re called “users”.
Users are great at noticing problems on your site. They’ll notice if it’s slow to load. They’ll notice error pages. They’ll also notice subtler things — nonsensical responses, incomplete or incorrect data, actions that never complete.
Someone might come right out and tell you you have a problem, but you don’t want to rely on that — it might take a long time for the message to work its way through to your operations team. Fortunately, there’s another approach: watch for a dropoff in normal operations. If users can’t get to your site, or can’t read the page, or their data isn’t showing up, they’ll react by not doing things they normally do — and that’s something you can easily detect.
This might seem simple, but it’s a remarkably robust way of detecting problems. An insufficient-activity alert can detect a broad spectrum of problems, including incorrect content, as well as operations that don’t complete. Furthermore, it won’t be thrown off by a handful of users doing something strange, so there will be few false alarms.
Consider the real-world incidents mentioned above. In Oracle’s case, actions that would normally occur as a result of users clicking through from the home page would have come to a screeching halt. In Microsoft’s case, the rate of successful VM launches dropped straight to zero the moment the incident began.
Red alert: insufficient dollars per second
When I worked at Google, internal lore held that the most important alert in the entire operations system checked for a drop in “advertising dollars earned per second”. This is a great metric to watch, because it rolls up all sorts of behavior. Anything from a data center connectivity problem, to a code bug, to mistuning in the AdWords placement algorithms would show up here. And as a direct measurement of a critical business metric, it’s relatively immune to false alarms. Can you think of a scenario where Google’s incoming cash takes a sudden drop, and the operations team wouldn’t want to know about it?
Alongside your traditional “too many bad things” alerts, you should have some “not enough good things” alerts. The specifics will depend on your application, but you might look for a dropoff in page loads, or invocations of important actions. It’s a good idea to cover a variety of actions. For instance, if you were in charge of operations for Twitter, you might start by monitoring the rate of new tweets, replies, clickthroughs on search results, accounts created, and successful logins. Think about each important subsystem you’re running, and make sure that you’re monitoring at least one user action which depends on that subsystem.
It’s often best to look for a sudden drop, rather than comparing to a fixed threshold. You might alert if the rate of events over the last 5 minutes is 30% lower than the average over the preceding half hour. This avoids false positives or negatives due to normal variations in usage.
Note that dropoff alerts are a complement to the usual “too many bad things” alerts, not a replacement. Each approach can find problems that will be missed by the other.
A quick plug
If you liked this article, you’ll probably like Scalyr Logs, our hosted monitoring service. Logs is a comprehensive DevOps tool, combining server monitoring, log analysis, alerts, and dashboards into a single easy-to-use service. Built by experienced devops engineers, it’s designed with the practical, straightforward, get-the-job-done approach shown here.
It’s easy to create usage-based alerts in Logs. Suppose you want to alert if posts to the URL “/upload” drop by 30% in five minutes. The following expression will do the trick:
countPerSecond:5m(POST ‘/upload’) < 0.7 * countPerSecond:30m(POST ‘/upload’)
To check for drops in some other event, just change the query inside the two pairs of parentheses.
Last year, I gave a talk on server monitoring which touched on this technique and a variety of others. You can watch it at youtube.com/watch?v=6NVapYun0Xc. Or stay tuned to this blog for more articles in this series.
“Holy crap. You guys are awesome… I’m already finding issues I wasn’t aware of. The ability to click on a piece of the log and find similar items is fantastic.”
18 months ago, we began developing Scalyr Logs. Scalyr Logs combines server monitoring, log collection and analysis, alerts, dashboards, and other functions into a practical, comprehensive DevOps tool. Last fall, we began real-world deployments in a closed beta program. The quote above was a comment – unsolicited – from one of our beta customers. Today, we’re excited to announce that we have exited beta and the service is available for all.
We are a team of ex-Google engineers with years of production experience. We know what it’s like to be on call, get a trouble alert, and not have enough information to narrow down the problem. We know what it’s like to juggle half a dozen balky monitoring systems in an effort to get a complete picture. We know what it’s like to scramble to respond to a crisis, casting about to find out what’s wrong, and have to wait agonizing seconds for each new graph to load. We know what it’s like to *know* that the information you need is in the logs somewhere, but not be able to get to it without taking time you don’t have to write code you don’t want to maintain.
Scalyr Logs is the straightforward, comprehensive monitoring tool we’d always wanted. Using a simple agent, you upload logs, system metrics, custom metrics, and other data in realtime to our custom-built database. You can then search, analyze, graph, build dashboards, and define alerts, all in a single easy-to-use package. It’s a service, so there’s no management overhead. And it’s fast, practical, and powerful. So if your current monitoring solution is frustrating you, if you can’t get the information you need when you need it, or you’re spending too much time juggling tools, Scalyr Logs is for you.
The reaction from our beta customers has been gratifyingly positive. More unsolicited quotes:
“This is the kind of tool… well, I sat down yesterday morning to glance at something, found myself exploring, and the next thing I knew it was time for lunch.”
“Love the performance.”
“You get an ‘aha’ moment when you’ve found it possible to pinpoint something that’s gone wrong using filters, facets and pivots.”
“This is great!”
“You guys are doing great work!”
“+1 on wallview. Great!”
“[log exploration] is great!”
Learn more here. We offer a 30-day free trial for up to 10 servers — sign up today and be up and running in minutes!
Amazon has posted the talks from re:Invent on YouTube. The video from the EBS session is here. My brief presentation on “Benchmarking in the Cloud” starts at the 30:16 mark (direct link). You can download my slides here.
It was a terrific conference. The pace of development, and just plain enthusiasm and energy, around cloud services in general and AWS in particular is just amazing. I do recommend checking out some of the talks if you have time.
The video to my talk on server monitoring (“Famous Outages, and How To Not Have Them”) is now available:
Thanks to Box for providing the venue and a good crowd, and thanks to the crowd for a great response. The talk is aimed at anyone who is running a production system, large or small. The focus is on how to get good monitoring coverage for a reasonable investment in effort; spiced up with plenty of stories about real-world production outages.
I’ll be speaking briefly on the subject of Cloud Benchmarks at Amazon’s re: Invent conference, in Las Vegas this week. This will be a brief presentation during the “Using Amazon Elastic Block Store” session, 2:05 Wednesday afternoon in Venetian B. If you happen to be at the conference, come check it out — if not for my presentation, then for Scot VanDenPlas, devops lead for the noted Obama for America technology effort, who will be speaking in the same session.
We’ll be around the show on Wednesday and Thursday. If you’re going to be there and would like to chat (about server monitoring, cloud benchmarks, or anything else), drop me a line at firstname.lastname@example.org.