Summary: Most server problems, once identified, can be quickly solved with a simple compensating action—for instance, rolling back the bad code you just pushed. The worst outages are those where reversing the cause doesn’t undo the effect. Fortunately, this type of issue usually generates some visible markers before developing into a crisis. In this post, I’ll talk about how you can avoid a lot of operational grief by watching for those markers. (more…)
A few years ago, we set out to rebuild server and log monitoring from the ground up. Today marks a new and exciting chapter in the story. To tell it properly, let me take you back to a simpler time: the year 2005.
I had just co-founded Writely — “The Web Word Processor!” — and usage was skyrocketing. We ran the whole thing on four leased servers in Texas. It was the clunkiest setup you’d ever seen, but there were few moving parts and it wasn’t much trouble to manage.
Within a year, we were acquired by Google, merged with a spreadsheet app, renamed “Google Docs”, and relaunched on Google infrastructure. The new system was infinitely more scalable, but quite complex. We depended on a slew of independent services: load balancing, data storage, user identity, email, spell checking, and more. (more…)
This is a post I’ve been looking forward to writing. We’re entering a new stage at Scalyr, and we’re looking for a few strong engineers — frontend, backend, and devops — to join us as we reinvent system monitoring and log analysis from the ground up, and bring Google Search levels of power and responsiveness to operations visibility.
Here’s why this matters to you: we have a small, tight team (lots of room for personal growth), traction, plenty of runway, a low-stress culture, and meaty problems to tackle. Want to be part of an awesome founding team (and draw a real salary while you’re at it)? We’re aiming high, rethinking everything from how to manage huge data sets to how engineers interact with their tools.
Sure, you’re doing fine in your current job. But if you love building and using great tools, you can do better than “fine”.
If you’d like to have a low-pressure chat about what we’re up to, check out Scalyr Careers to learn more about us, then drop me a line at firstname.lastname@example.org.
Being “on call” is often the most dreaded part of server operations. In the immortal words of Devops Borat, “Devops is intersection of lover of cloud and hater of wake up at 3 in morning.” Building and operating sophisticated systems is often a lot of fun, but it comes with a dark side: being jarred out of a sound sleep by the news that your site is down — often in some new and mysterious way. Keeping your servers stable around the clock often clashes with a sane work schedule.
At Scalyr, we work hard to combat this. Our product is a server monitoring and log analysis service. It’s internally complex, running on about 20 servers, with mostly custom-built software. But in the last 12 months, with little after-hours attention, we’ve had less than one hour of downtime. There were only 11 pager incidents before 9:00 AM / after 5:00 PM, and most were quickly identifiable as false alarms, dismissible in less time than it would take for dinner to get cold.
In this article, I explain how we keep things running on a mostly 9-to-5 schedule. (more…)
When your problem is impossible, redefine the problem.
In an earlier article, I described how Scalyr searches logs at tens of gigabytes per second using brute force. This works great for its intended purpose: enabling exploratory analysis of all your logs in realtime. However, we realized early on that some features of Scalyr―such as custom dashboards built on data parsed from server logs―would require searching terabytes per second. Gulp!
In this article, I’ll describe how we solved the problem, using two helpful principles for systems design:
- Common user actions must lead to simple server actions. Infrequent user actions can lead to complex server actions.
- Find a data structure that makes your key operation simple. Then design your system around that data structure.
Often, a seemingly impossible challenge becomes tractable if you can reframe it. These principles can help you find an appropriate reframing for systems engineering problems. (more…)
TL;DR: Four years ago, I left Google with the idea for a new kind of server monitoring tool. The idea was to combine traditionally separate functions such as log exploration, log aggregation and analysis, metrics gathering, alerting, and dashboard generation into a single service. One tenet was that the service should be fast, giving ops teams a lightweight, interactive, “fun” experience. This would require analyzing multi-gigabyte data sets at subsecond speeds, and doing it on a budget. Existing log management tools were often slow and clunky, so we were facing a challenge, but the good kind — an opportunity to deliver a new user experience through solid engineering.
This article describes how we met that challenge using an “old school”, brute-force approach, by eliminating layers and avoiding complex data structures. There are lessons here that you can apply to your own engineering challenges. (more…)
As engineers, we take pride in building tools for ourselves and others like us. Since our launch, we’ve been continually expanding and improving the capabilities of the Scalyr devops tools in response to our customers. However, our documentation sometimes lagged behind — until now. (more…)
Our goal at Scalyr is to provide sysadmins and DevOps engineers with a single log monitoring tool that replaces the hodgepodge of tools they were previously using. We’ve come a long way in doing that. Today, Scalyr is a unified, cloud-based tool that lets you aggregate multiple server logs, monitor and analyze them, set custom log alerts, and create custom dashboards. Still, we work hard to continue improving and making it an even more useful tool for you, and we listen closely to users’ feedback. (more…)