Impossible Engineering Problems Often Aren’t

When your problem is impossible, redefine the problem.

In an earlier article, I described how Scalyr searches logs at tens of gigabytes per second using brute force. This works great for its intended purpose: enabling exploratory analysis of all your logs in realtime. However, we realized early on that some features of Scalyr―such as custom dashboards built on data parsed from server logs―would require searching terabytes per second. Gulp!

In this article, I’ll describe how we solved the problem, using two helpful principles for systems design:

  • Common user actions must lead to simple server actions. Infrequent user actions can lead to complex server actions.
  • Find a data structure that makes your key operation simple. Then design your system around that data structure.

Often, a seemingly impossible challenge becomes tractable if you can reframe it. These principles can help you find an appropriate reframing for systems engineering problems. (more…)

Read More

Searching 20 GB/sec: Systems Engineering Before Algorithms

TL;DR: Four years ago, I left Google with the idea for a new kind of server monitoring tool. The idea was to combine traditionally separate functions such as log exploration, log aggregation and analysis, metrics gathering, alerting, and dashboard generation into a single service. One tenet was that the service should be fast, giving ops teams a lightweight, interactive, “fun” experience. This would require analyzing multi-gigabyte data sets at subsecond speeds, and doing it on a budget. Existing log management tools were often slow and clunky, so we were facing a challenge, but the good kind — an opportunity to deliver a new user experience through solid engineering.

This article describes how we met that challenge using an “old school”, brute-force approach, by eliminating layers and avoiding complex data structures. There are lessons here that you can apply to your own engineering challenges. (more…)

Read More

A New Look for Server Log Monitoring

As engineers, we take pride in building tools for ourselves and others like us. Since our launch, we’ve been continually expanding and improving the capabilities of the Scalyr devops tools in response to our customers. However, our documentation sometimes lagged behind — until now. (more…)

Read More

One API for All Your Server Logs

Our goal at Scalyr is to provide sysadmins and DevOps engineers with a single log monitoring tool that replaces the hodgepodge of tools they were previously using. We’ve come a long way in doing that. Today, Scalyr is a unified, cloud-based tool that lets you aggregate multiple server logs, monitor and analyze them, set custom alerts, and create custom dashboards. Still, we work hard to continue improving and making it an even more useful tool for you, and we listen closely to users’ feedback. (more…)

Read More

Cloud Cost Calculator

There are many, many options for cloud server hosting nowadays. EC2 pricing alone is so complex that quite a few pages have been built to help sort it out. Even so, while comparing costs for various scenarios — on demand vs. reserved instances, “light utilization” vs. “heavy utilization” reservations, EC2 vs. other cloud providers — we here at Scalyr recently found ourselves building spreadsheets and looking up net-present-value formulas. That seemed a bit silly, so we decided to do something about it. And so we now present, without further ado: the Cloud Cost Calculator. (more…)

Read More

Optimizing AngularJS: 1200ms to 35ms

Edit: Due to the level of interest, we’ve released the source code to the work described here: https://github.com/scalyr/angular. Also, some good discussion at Hacker News.

Here at Scalyr, we recently embarked on a full rewrite of our web client. Our application, Scalyr Logs, is a broad-spectrum monitoring and log analysis tool. Our home-grown log database executes most queries in tens of milliseconds, but each interaction required a page load, taking several seconds for the user. (more…)

Read More

Exploring the Github Events Firehose

Here at Scalyr, we’ve been having a lot of fun building out a high-speed query engine for log data, and a snappy UI using AngularJS. However, we haven’t had a good way to show it off: a data exploration tool is useless without data to explore. This has been a challenge when it comes to giving people a way to play with Scalyr Logs before signing up. We recently learned that Github provides a feed of all actions on public repositories. That sounded like a fun basis for a demo, so we began importing the feed. (To explore the data yourself, see the last paragraph.) (more…)

Read More

Good News: Your Monitoring Is All Wrong

This is the first in a series of articles on server monitoring techniques. If you’re responsible for a production service, this series is for you.

In this post, I’ll describe a technique for writing alerting rules. The idea is deceptively simple: alert when normal, desirable user actions suddenly drop. This may sound mundane, but it’s a surprisingly effective way to increase your alerting coverage (the percentage of problems for which you’re promptly notified), while minimizing false alarms. (more…)

Read More