November 2016 Product Updates: All in on the New UI…

Our new UI is now the default

Scalyr's New UI

We flipped the switch, and our new UI is now the default. The original UI will still be available for a while, but to access it, you’ll have to click on the settings menu (upper-right corner of the window) and choose “Flip To Classic UI”.

If there’s a reason you prefer our original UI, please let us know! We’re working hard to make this an easy transition.

The log timeline chart now includes both line and bar graphs

Scalyr's Hybrid Timeline Chart

By popular demand, we’ve superimposed the old “events per second” line graph over the bar chart. This allows you to see fine-grained spikes and changes in message frequency. Note that the Y axis is scaled for the bar chart only.

Read More

The Myth of the Root Cause: How Complex Web Systems Fail

Editor’s note: here at Scalyr, robust systems are a topic near and dear to our heart, so we were happy to have the chance to work closely with Mathias on this piece. It’s based on the two-part series “How Complex Web Systems Fail” originally published on his Production Ready mailing list.

how-complex-systems-fail

Distributed web-based systems are inherently complex. They’re composed of many moving parts — web servers, databases, load balancers, CDNs, and many more — working together to form an intricate whole. This complexity inevitably leads to failure. Understanding how this failure happens (and how we can prevent it) is at the core of our job as operations engineers.

In his influential paper How Complex Systems Fail, Richard Cook shares 18 sharp observations on the nature of failure in complex medical systems. The nice thing about these observations is that most of them hold true for complex systems in general. Our intuitive notions of cause-and-effect, where each outage is attributable to a direct root cause, are a poor fit to the reality of modern systems.

In this post, I’ll translate Cook’s insights into the context of our beloved web systems and explore how they fail, why they fail, how you can prepare for outages, and how you can prevent similar failures from happening in the future…

Read More

So You’ve Been Paged: A Guide to Incident Response (For Those Who Hate Being Paged)

One of the inevitable joys of working in DevOps is “the page” — that dreaded notification from your alerting system that something has gone terribly wrong…and you’re the lucky person who gets to fix it.

Here at Scalyr, we’ve got a few decades of collective DevOps experience and we’ve all been on the receiving end of a page. Even though we do our best to avoid being woken up, it happens.

In this post, we’re going to put some of that experience to use and show you how to handle an incident the right way. You’ll learn not only how to fix the immediate problem, but how to grow from the experience and set your team up for smooth sailing in the future.

Ugh...Phones.

Read More

September 2016 Product Updates: The New UI Gets Even More Powerful…

We’ve added a lot of features to the New UI this month and we’re excited to share them with you. If you haven’t given it a try, now is a great time to join the large body of Scalyr users who have switched over.

New UI to become default on October 10th

 

Last year we began a ground-up rewrite of the Scalyr interface. Our goal was to preserve the speed and power that made people love Scalyr, while making it easier to learn, quicker to use, and – let’s face it – easier on the eyes.

We’ve been testing this new UI (appropriately dubbed….the New UI) over the past few months and steadily making improvements along the way. We’re now ready to jump into the deeper end of the pool:

We’ll be flipping the switch to make the New UI the default choice starting on October 10th. (Note – The original UI will still be available for a while, but you’ll have to enable it on a per-session basis.)

If there’s a reason you prefer our original UI or are hesitant to switch, please let us know! We’re working hard to make this an easy transition.

 

Try the New UI Now

 

1) Distributions

 

histogram-screenshot

Distributions (previously known as “Histograms”) show you a breakdown of the values in a numeric field by frequency. A graph only provides basic statistics such as average or 90th percentile. Distribution view shows how the values break down in detail.

Read More

August 2016 Product Updates: New UI Beta, Breakdown Graphs, SNMP support, and more…

Here at Scalyr we’ve been hard at work on some major product improvements, and we’re pleased to share the fruits of those labors.

1) New UI Beta:  Logs & Graphs

 

A screenshot of Scalyr's new UI

 

Last year we began a ground-up rewrite of the Scalyr interface. Our goal was to preserve the speed and power that made people love Scalyr while making it easier to learn, quicker to use, and – let’s face it – easier on the eyes.

Read More

Irreversible Failures: Lessons from the DynamoDB Outage

Summary: Most server problems, once identified, can be quickly solved with a simple compensating action—for instance, rolling back the bad code you just pushed. The worst outages are those where reversing the cause doesn’t undo the effect. Fortunately, this type of issue usually generates some visible markers before developing into a crisis. In this post, I’ll talk about how you can avoid a lot of operational grief by watching for those markers.
Read More

Scalyr Announces $2.1M Seed Round To Reinvent System Visibility

A few years ago, we set out to rebuild server and log monitoring from the ground up. Today marks a new and exciting chapter in the story. To tell it properly, let me take you back to a simpler time: the year 2005.

I had just co-founded Writely — “The Web Word Processor!” — and usage was skyrocketing. We ran the whole thing on four leased servers in Texas. It was the clunkiest setup you’d ever seen, but there were few moving parts and it wasn’t much trouble to manage.

Within a year, we were acquired by Google, merged with a spreadsheet app, renamed “Google Docs”, and relaunched on Google infrastructure. The new system was infinitely more scalable, but quite complex. We depended on a slew of independent services: load balancing, data storage, user identity, email, spell checking, and more.

Read More

Scalyr is Hiring!

This is a post I’ve been looking forward to writing. We’re entering a new stage at Scalyr, and we’re looking for a few strong engineers — frontend, backend, and devops — to join us as we reinvent system monitoring and log analysis from the ground up, and bring Google Search levels of power and responsiveness to operations visibility.

Here’s why this matters to you: we have a small, tight team (lots of room for personal growth), traction, plenty of runway, a low-stress culture, and meaty problems to tackle. Want to be part of an awesome founding team (and draw a real salary while you’re at it)? We’re aiming high, rethinking everything from how to manage huge data sets to how engineers interact with their tools.

Sure, you’re doing fine in your current job. But if you love building and using great tools, you can do better than “fine”.

If you’d like to have a low-pressure chat about what we’re up to, check out Scalyr Careers to learn more about us, then drop me a line at steve@scalyr.com.

99.99% uptime on a 9-to-5 schedule

Running a 24/7 Log Monitoring Service

Being “on call” is often the most dreaded part of server operations. In the immortal words of Devops Borat, “Devops is intersection of lover of cloud and hater of wake up at 3 in morning.” Building and operating sophisticated systems is often a lot of fun, but it comes with a dark side: being jarred out of a sound sleep by the news that your site is down — often in some new and mysterious way. Keeping your servers stable around the clock often clashes with a sane work schedule.

At Scalyr, we work hard to combat this. Our product is a server monitoring and log analysis service. It’s internally complex, running on about 20 servers, with mostly custom-built software. But in the last 12 months, with little after-hours attention, we’ve had less than one hour of downtime. There were only 11 pager incidents before 9:00 AM / after 5:00 PM, and most were quickly identifiable as false alarms, dismissible in less time than it would take for dinner to get cold.

In this article, I explain how we keep things running on a mostly 9-to-5 schedule.

Read More

Impossible Engineering Problems Often Aren’t

When your problem is impossible, redefine the problem.

In an earlier article, I described how Scalyr searches logs at tens of gigabytes per second using brute force. This works great for its intended purpose: enabling exploratory analysis of all your logs in realtime. However, we realized early on that some features of Scalyr―such as custom dashboards built on data parsed from server logs―would require searching terabytes per second. Gulp!

In this article, I’ll describe how we solved the problem, using two helpful principles for systems design:

  • Common user actions must lead to simple server actions. Infrequent user actions can lead to complex server actions.
  • Find a data structure that makes your key operation simple. Then design your system around that data structure.

Often, a seemingly impossible challenge becomes tractable if you can reframe it. These principles can help you find an appropriate reframing for systems engineering problems.

Read More