Being “on call” is often the most dreaded part of server operations. In the immortal words of Devops Borat, “Devops is intersection of lover of cloud and hater of wake up at 3 in morning.” Building and operating sophisticated systems is often a lot of fun, but it comes with a dark side: being jarred out of a sound sleep by the news that your site is down — often in some new and mysterious way. Keeping your servers stable around the clock often clashes with a sane work schedule.
At Scalyr, we work hard to combat this. Our product is a server monitoring and log analysis service. It’s internally complex, running on about 20 servers, with mostly custom-built software. But in the last 12 months, with little after-hours attention, we’ve had less than one hour of downtime. There were only 11 pager incidents before 9:00 AM / after 5:00 PM, and most were quickly identifiable as false alarms, dismissible in less time than it would take for dinner to get cold.
In this article, I explain how we keep things running on a mostly 9-to-5 schedule.