Searching 1TB/sec: Systems Engineering Before Algorithms

TL;DR: Four years ago, I left Google with the idea for a new kind of server monitoring tool. The idea was to combine traditionally separate functions such as log exploration, log aggregation and analysis, metrics gathering, alerting, and dashboard generation into a single service. One tenet was that the service should be fast, giving ops teams a lightweight, interactive, “fun” experience. This would require analyzing multi-gigabyte data sets at subsecond speeds, and doing it on a budget. Existing log management tools were often slow and clunky, so we were facing a challenge, but the good kind — an opportunity to deliver a new user experience through solid engineering.

This article describes how we met that challenge using an “old school”, brute-force approach, by eliminating layers and avoiding complex data structures. There are lessons here that you can apply to your own engineering challenges.

Read More

Cloud Cost Calculator

Editor’s Note!: While many of you may find this somewhat dated post to be interesting, the calculator itself has been retired for now. We’ve removed links to the Cloud Calculator below.

There are many, many options for cloud server hosting nowadays. EC2 pricing alone is so complex that quite a few pages have been built to help sort it out. Even so, while comparing costs for various scenarios — on demand vs. reserved instances, “light utilization” vs. “heavy utilization” reservations, EC2 vs. other cloud providers — we here at Scalyr recently found ourselves building spreadsheets and looking up net-present-value formulas. That seemed a bit silly, so we decided to do something about it. And so we now present, without further ado: the Cloud Cost Calculator [Link Removed – Content out of date!].Read More

Optimizing AngularJS: 1200ms to 35ms

Edit: Due to the level of interest, we’ve released the source code to the work described here:

Here at Scalyr, we recently embarked on a full rewrite of our web client. Our application is a broad-spectrum monitoring and log analysis tool. Our home-grown log database executes most queries in tens of milliseconds, but each interaction required a page load, taking several seconds for the user.Read More

Exploring the Github Events Firehose

Here at Scalyr, we’ve been having a lot of fun building out a high-speed query engine for log data, and a snappy UI using AngularJS. However, we haven’t had a good way to show it off: a data exploration tool is useless without data to explore. This has been a challenge when it comes to giving people a way to play with Scalyr Logs before signing up. We recently learned that Github provides a feed of all actions on public repositories. That sounded like a fun basis for a demo, so we began importing the feed. (To explore the data yourself, see the last paragraph.)Read More

Good News: Your Monitoring Is All Wrong

This is the first in a series of articles on server monitoring techniques. If you’re responsible for a production service, this series is for you.

In this post, I’ll describe a technique for writing alerting rules. The idea is deceptively simple: alert when normal, desirable user actions suddenly drop. This may sound mundane, but it’s a surprisingly effective way to increase your alerting coverage (the percentage of problems for which you’re promptly notified), while minimizing false alarms.Read More

A Systematic Look at EC2 I/O

At Scalyr, we’re building a large-scale storage system for timeseries and log data. To make good design decisions, we need hard data about EC2 I/O performance.

Plenty of data has been published on this topic, but we couldn’t really find the answers we needed. Most published data is specific to a particular application or EC2 configuration, or was collected from a small number of instances and hence is statistically suspect. (More on this below.)

Since the data we wanted wasn’t readily available, we decided to collect it ourselves. For the benefit of the community, we’re presenting our results here. These tests involved over 1000 EC2 instances, $1000 in AWS charges, and billions of I/O operations.Read More

Transparency in Cloud Services

37signals recently launched public “Uptime Reports” for their applications (announcement). The reaction on Hacker News was rather tepid, but I think it’s a positive development, and I applaud 37signals for stepping forward. Reliability of cloud applications is a real concern, and there’s not nearly enough hard data out there. Not all products are equally reliable; even within 37signals, the new reports show a 3:1 variation in downtime across apps.Read More