Good News: Your Monitoring Is All Wrong

This is the first in a series of articles on server monitoring techniques. If you’re responsible for a production service, this series is for you.

In this post, I’ll describe a technique for writing alerting rules. The idea is deceptively simple: alert when normal, desirable user actions suddenly drop. This may sound mundane, but it’s a surprisingly effective way to increase your alerting coverage (the percentage of problems for which you’re promptly notified), while minimizing false alarms.Read More

Announcing Scalyr Logs

“Holy crap. You guys are awesome… I’m already finding issues I wasn’t aware of. The ability to click on a piece of the log and find similar items is fantastic.”

18 months ago, we began developing Scalyr, which combines server monitoring, log collection and analysis, alerts, dashboards, and other functions into a practical, comprehensive DevOps tool. Last fall, we began real-world deployments in a closed beta program. The quote above was a comment – unsolicited – from one of our beta customers. Today, we’re excited to announce that we have exited beta and the service is available for all.Read More

“Benchmarking in the Cloud” talk online

Amazon has posted the talks from re:Invent on YouTube. The video from the EBS session is here. My brief presentation on “Benchmarking in the Cloud” starts at the 30:16 mark (direct link). You can download my slides here.

It was a terrific conference. The pace of development, and just plain enthusiasm and energy, around cloud services in general and AWS in particular is just amazing. I do recommend checking out some of the talks if you have time.

Server Monitoring Talk Now Online

The video to my talk on server monitoring (“Famous Outages, and How To Not Have Them”) is now available:

Thanks to Box for providing the venue and a good crowd, and thanks to the crowd for a great response. The talk is aimed at anyone who is running a production system, large or small. The focus is on how to get good monitoring coverage for a reasonable investment in effort; spiced up with plenty of stories about real-world production outages.

Cloud Benchmarks presentation at re: Invent

I’ll be speaking briefly on the subject of Cloud Benchmarks at Amazon’s re: Invent conference, in Las Vegas this week. This will be a brief presentation during the “Using Amazon Elastic Block Store” session, 2:05 Wednesday afternoon in Venetian B. If you happen to be at the conference, come check it out — if not for my presentation, then for Scot VanDenPlas, devops lead for the noted Obama for America technology effort, who will be speaking in the same session.

We’ll be around the show on Wednesday and Thursday. If you’re going to be there and would like to chat (about server monitoring, cloud benchmarks, or anything else), drop me a line at

Tech Talk: Famous Outages, and How To Not Have Them

This Wednesday at 6:00 PM, I’ll be giving a talk on server monitoring at Box headquarters in Los Altos, California. If you’re in the area, it should be fun. If not, we’ll be posting the video on YouTube later. Register (it’s free!) at:

Your company is growing rapidly and becoming more successful every day. You have a team that actively does server monitoring. Or maybe you are still too small to dedicate resources to it. You think you are prepared for the worst… and then seemingly out of the blue, your site goes down and it feels like the world has ended. What do you do? What went wrong? How could you have prevented it?

Steve Newman knows this pain. In this talk, he will discuss going beyond the basics of server monitoring: to detect subtle problems before your users do, to use forensic techniques for chasing down non-reproducible bugs, to actively do capacity planning, and more.

The talk will be built around a series of postmortems of real-world incidents, some of which made the newspapers.

Come hear one of the founding fathers of Google Docs talk at Box!

A Systematic Look at EC2 I/O

At Scalyr, we’re building a large-scale storage system for timeseries and log data. To make good design decisions, we need hard data about EC2 I/O performance.

Plenty of data has been published on this topic, but we couldn’t really find the answers we needed. Most published data is specific to a particular application or EC2 configuration, or was collected from a small number of instances and hence is statistically suspect. (More on this below.)

Since the data we wanted wasn’t readily available, we decided to collect it ourselves. For the benefit of the community, we’re presenting our results here. These tests involved over 1000 EC2 instances, $1000 in AWS charges, and billions of I/O operations.Read More

Introducing Scalyr Logs

Today we’re excited to announce a pair of new services from Scalyr:

Scalyr is a new approach to server monitoring and analysis. Traditionally, this has been treated as a series of special-case problems: timeseries/graphing, log search, external monitoring, dashboards, alerting, exception tracking, performance analysis, etc. In my career, I’ve had to juggle too many tools in an attempt to get a complete picture of a system’s behavior — and been frustrated at the disconnected, patchwork result. I’ve spent far too many hours trying to figure out which graph explains why my pager went off, or which logs might help me understand why an error graph just spiked, or taking random peeks into log files because I don’t have a tool that can analyze them in the way I need.Read More