Cloud Cost Calculator

Editor’s Note!: While many of you may find this somewhat dated post to be interesting, the calculator itself has been retired for now. We’ve removed links to the Cloud Calculator below.

There are many, many options for cloud server hosting nowadays. EC2 pricing alone is so complex that quite a few pages have been built to help sort it out. Even so, while comparing costs for various scenarios — on demand vs. reserved instances, “light utilization” vs. “heavy utilization” reservations, EC2 vs. other cloud providers — we here at Scalyr recently found ourselves building spreadsheets and looking up net-present-value formulas. That seemed a bit silly, so we decided to do something about it. And so we now present, without further ado: the Cloud Cost Calculator [Link Removed – Content out of date!].Read More

Exploring the Github Events Firehose

Here at Scalyr, we’ve been having a lot of fun building out a high-speed query engine for log data, and a snappy UI using AngularJS. However, we haven’t had a good way to show it off: a data exploration tool is useless without data to explore. This has been a challenge when it comes to giving people a way to play with Scalyr Logs before signing up. We recently learned that Github provides a feed of all actions on public repositories. That sounded like a fun basis for a demo, so we began importing the feed. (To explore the data yourself, see the last paragraph.)Read More

Good News: Your Monitoring Is All Wrong

This is the first in a series of articles on server monitoring techniques. If you’re responsible for a production service, this series is for you.

In this post, I’ll describe a technique for writing alerting rules. The idea is deceptively simple: alert when normal, desirable user actions suddenly drop. This may sound mundane, but it’s a surprisingly effective way to increase your alerting coverage (the percentage of problems for which you’re promptly notified), while minimizing false alarms.Read More

Announcing Scalyr Logs

“Holy crap. You guys are awesome… I’m already finding issues I wasn’t aware of. The ability to click on a piece of the log and find similar items is fantastic.”

18 months ago, we began developing Scalyr, which combines server monitoring, log collection and analysis, alerts, dashboards, and other functions into a practical, comprehensive DevOps tool. Last fall, we began real-world deployments in a closed beta program. The quote above was a comment – unsolicited – from one of our beta customers. Today, we’re excited to announce that we have exited beta and the service is available for all.Read More

“Benchmarking in the Cloud” talk online

Amazon has posted the talks from re:Invent on YouTube. The video from the EBS session is here. My brief presentation on “Benchmarking in the Cloud” starts at the 30:16 mark (direct link). You can download my slides here.

It was a terrific conference. The pace of development, and just plain enthusiasm and energy, around cloud services in general and AWS in particular is just amazing. I do recommend checking out some of the talks if you have time.

Server Monitoring Talk Now Online

The video to my talk on server monitoring (“Famous Outages, and How To Not Have Them”) is now available:

Thanks to Box for providing the venue and a good crowd, and thanks to the crowd for a great response. The talk is aimed at anyone who is running a production system, large or small. The focus is on how to get good monitoring coverage for a reasonable investment in effort; spiced up with plenty of stories about real-world production outages.

Cloud Benchmarks presentation at re: Invent

I’ll be speaking briefly on the subject of Cloud Benchmarks at Amazon’s re: Invent conference, in Las Vegas this week. This will be a brief presentation during the “Using Amazon Elastic Block Store” session, 2:05 Wednesday afternoon in Venetian B. If you happen to be at the conference, come check it out — if not for my presentation, then for Scot VanDenPlas, devops lead for the noted Obama for America technology effort, who will be speaking in the same session.

We’ll be around the show on Wednesday and Thursday. If you’re going to be there and would like to chat (about server monitoring, cloud benchmarks, or anything else), drop me a line at

Tech Talk: Famous Outages, and How To Not Have Them

This Wednesday at 6:00 PM, I’ll be giving a talk on server monitoring at Box headquarters in Los Altos, California. If you’re in the area, it should be fun. If not, we’ll be posting the video on YouTube later. Register (it’s free!) at:

Your company is growing rapidly and becoming more successful every day. You have a team that actively does server monitoring. Or maybe you are still too small to dedicate resources to it. You think you are prepared for the worst… and then seemingly out of the blue, your site goes down and it feels like the world has ended. What do you do? What went wrong? How could you have prevented it?

Steve Newman knows this pain. In this talk, he will discuss going beyond the basics of server monitoring: to detect subtle problems before your users do, to use forensic techniques for chasing down non-reproducible bugs, to actively do capacity planning, and more.

The talk will be built around a series of postmortems of real-world incidents, some of which made the newspapers.

Come hear one of the founding fathers of Google Docs talk at Box!

A Systematic Look at EC2 I/O

At Scalyr, we’re building a large-scale storage system for timeseries and log data. To make good design decisions, we need hard data about EC2 I/O performance.

Plenty of data has been published on this topic, but we couldn’t really find the answers we needed. Most published data is specific to a particular application or EC2 configuration, or was collected from a small number of instances and hence is statistically suspect. (More on this below.)

Since the data we wanted wasn’t readily available, we decided to collect it ourselves. For the benefit of the community, we’re presenting our results here. These tests involved over 1000 EC2 instances, $1000 in AWS charges, and billions of I/O operations.Read More