Building a Sustainable Startup

Although Scalyr has been around since 2011, it feels like we are really just getting started.

So many tech startups come roaring out of the gate, pursuing growth at all costs. Sometimes this leads to spectacular success, but much more often it leads to burnout and retrenchment, if not outright failure. Many promising startups have failed due to early over-reach.

At Scalyr, we’re taking a different approach. We spent over three years with a small team, literally above my garage, taking our time to build the right product in the right way. Only after we’d built a differentiated product that our early users loved did we set out to grow the team and the business.

We have been absolutely blown away by the results:

  • Customer devotion: since we signed our first customer in mid-2013, we have not had a customer leave us for another solution. The sheer performance of our log management service has been a pronounced and sustained differentiator.
  • Word of mouth: once adopted, Scalyr tends to spread within an organization. We have had multiple instances of customers beginning at five-figure annual revenue and growing to seven figures.
  • Scalability: through multiple orders of magnitude of growth, we’ve been able to maintain the performance and functionality that makes Scalyr special.

We are building a real business with real customers, and I’m excited to share some of our recent progress.  

 

Origin story

Rewind to 2006. Amazon EC2 and S3 were still under wraps, Facebook was only available to .edu addresses… and Google had just acquired my startup, Writely – soon to be known as Google Docs. Pretty soon, I was leading a project to build a new storage infrastructure for applications such as 

Docs, Sheets, Drive, and Picasa. Google has a strong culture of internal tool development, and we soon found ourselves using 17 different operational visibility tools to maintain a reliable service. Seventeen! Together, they provided a lot of functionality, but juggling that many tools was a bit of a nightmare.

It was clear that there was a lot of room for improvement. Around the industry, it’s more common to see teams using four or five visibility tools, rather than 17. But everyone suffers from too many tools, too little insight, and too much time spent investigating issues. In 2011, after leaving Google, I started Scalyr to create a better solution. Our ultimate goal is to revolutionize operational visibility, making it easy to understand the behavior of modern, complex cloud stacks.

The first step on this journey is our log management service. Logs provide the most detailed view of server and application behavior and are a critical piece of the operational puzzle. But existing log management tools were so clunky and slow that people avoided using them. Most of these tools are built on traditional keyword indexes, a technology originally designed to search books. Rethinking the problem from first principles, we built a profoundly more efficient solution, proving blazing-fast search over terabyte-scale aggregated logs.

The early response was beyond what any of us thought possible. We raised $2M in seed funding in 2015 to jumpstart the business.

 

Building a Sustainable Business

We have a long way to go to fulfill our long-term vision. To accomplish that, we have to build a sustainable business with solid fundamentals. We’ve been thoughtful about our growth to date, building the team in proportion to revenue. In fact, we’ve been hovering around breakeven – some months even profitable – for the past several quarters.

Sustainable growth requires a delicate balance: be aggressive enough to seize opportunities, but conservative enough to maintain company culture and healthy finances. Combining growth with healthy, sustainable practices requires more than simply pacing yourself. It requires efficiency. With VC cash burning a hole in your pocket, it’s tempting to throw money at all your problems; but then you’re quickly looking for your next round, and you’ve fallen off the sustainable path. So we’re always looking for ways to become more efficient.

It helps tremendously that we have a sustainable product differentiation – performance – that’s linked to a fundamental technological advantage.

 

Powering the next stage

With solid revenue, delighted customers, and a clear path forward, we’re limited only by the speed at which we can execute. So I’m excited to share that we have raised a $20M Series A led by Shasta Ventures, with participation from Heroic Ventures, Susa Ventures, and Bloomberg Beta – bringing our total amount raised to $28M. 

 

By the way, we’re hiring 🙂

This is the 6th startup I’ve [co]founded. People sometimes ask which one is my favorite. Writely was certainly the splashiest. Spectre was pretty cool (and was the last time I got to hack assembly language). But I can say without hesitation, Scalyr is the most satisfying, rewarding project I’ve been lucky enough to work on. I truly believe this is going to be one of those companies where people will look back and say “I wish I’d been there when…” – well, right now is “when”.

There couldn’t be a more exciting time to join! We’re hiring across all teams – especially engineering, sales, marketing, and customer success. Take a look at our Careers Page and become part of the journey.

Support Driven Development: Listen now so you don’t hear it later

Here at Scalyr, we’re big fans of Complaint-Driven Development, which I’ll summarize as “focus engineering effort on fixing the things users actually complain about.” We especially focus on issues that generate support requests, with such success that, as CEO, I’m still able to personally handle the majority of frontline support – even as we head toward eight-digit annual revenue.

An important consideration is that support requests cost money even if they aren’t your (product’s) fault. In this post, I’ll explore five common sources of support requests relating to the first piece of Scalyr software most users touch – our log collection agent and how we’ve sometimes had to think outside the box to address them. None of these were bugs, exactly. (We’ve had those as well, but you don’t need to read a blog post to know it’s a good idea to fix bugs.)

Arguably, none of these issues were “our fault.” But they generated a significant fraction of our support tickets. By eliminating them, we’ve reduced support costs significantly. Even more important, we’ve increased the probability that a user’s first experience with Scalyr is positive, especially for those users (a majority!) who will bounce off of a new product at the first sign of trouble, without bothering to ask for help.

Read More

Irreversible Failures: Lessons from the DynamoDB Outage

Summary: Most server problems, once identified, can be quickly solved with a simple compensating action—for instance, rolling back the bad code you just pushed. The worst outages are those where reversing the cause doesn’t undo the effect. Fortunately, this type of issue usually generates some visible markers before developing into a crisis. In this post, I’ll talk about how you can avoid a lot of operational grief by watching for those markers.
Read More

Scalyr Announces $2.1M Seed Round To Reinvent System Visibility

A few years ago, we set out to rebuild server and log monitoring from the ground up. Today marks a new and exciting chapter in the story. To tell it properly, let me take you back to a simpler time: the year 2005.

I had just co-founded Writely — “The Web Word Processor!” — and usage was skyrocketing. We ran the whole thing on four leased servers in Texas. It was the clunkiest setup you’d ever seen, but there were few moving parts and it wasn’t much trouble to manage.

Within a year, we were acquired by Google, merged with a spreadsheet app, renamed “Google Docs”, and relaunched on Google infrastructure. The new system was infinitely more scalable, but quite complex. We depended on a slew of independent services: load balancing, data storage, user identity, email, spell checking, and more.

Read More

Scalyr is Hiring!

This is a post I’ve been looking forward to writing. We’re entering a new stage at Scalyr, and we’re looking for a few strong engineers — frontend, backend, and devops — to join us as we reinvent system monitoring and log analysis from the ground up, and bring Google Search levels of power and responsiveness to operations visibility.

Here’s why this matters to you: we have a small, tight team (lots of room for personal growth), traction, plenty of runway, a low-stress culture, and meaty problems to tackle. Want to be part of an awesome founding team (and draw a real salary while you’re at it)? We’re aiming high, rethinking everything from how to manage huge data sets to how engineers interact with their tools.

Sure, you’re doing fine in your current job. But if you love building and using great tools, you can do better than “fine”.

If you’d like to have a low-pressure chat about what we’re up to, check out Scalyr Careers to learn more about us, then drop me a line at steve@scalyr.com.

99.99% uptime on a 9-to-5 schedule

Running a 24/7 Log Monitoring Service

Being “on call” is often the most dreaded part of server operations. In the immortal words of Devops Borat, “Devops is intersection of lover of cloud and hater of wake up at 3 in morning.” Building and operating sophisticated systems is often a lot of fun, but it comes with a dark side: being jarred out of a sound sleep by the news that your site is down — often in some new and mysterious way. Keeping your servers stable around the clock often clashes with a sane work schedule.

At Scalyr, we work hard to combat this. Our product is a server monitoring and log analysis service. It’s internally complex, running on about 20 servers, with mostly custom-built software. But in the last 12 months, with little after-hours attention, we’ve had less than one hour of downtime. There were only 11 pager incidents before 9:00 AM / after 5:00 PM, and most were quickly identifiable as false alarms, dismissible in less time than it would take for dinner to get cold.

In this article, I explain how we keep things running on a mostly 9-to-5 schedule.

Read More

Impossible Engineering Problems Often Aren’t

When your problem is impossible, redefine the problem.

In an earlier article, I described how Scalyr searches logs at tens of gigabytes per second using brute force. This works great for its intended purpose: enabling exploratory analysis of all your logs in realtime. However, we realized early on that some features of Scalyr―such as custom dashboards built on data parsed from server logs―would require searching terabytes per second. Gulp!

In this article, I’ll describe how we solved the problem, using two helpful principles for systems design:

  • Common user actions must lead to simple server actions. Infrequent user actions can lead to complex server actions.
  • Find a data structure that makes your key operation simple. Then design your system around that data structure.

Often, a seemingly impossible challenge becomes tractable if you can reframe it. These principles can help you find an appropriate reframing for systems engineering problems.

Read More

Searching 1TB/sec: Systems Engineering Before Algorithms

TL;DR: Four years ago, I left Google with the idea for a new kind of server monitoring tool. The idea was to combine traditionally separate functions such as log exploration, log aggregation and analysis, metrics gathering, alerting, and dashboard generation into a single service. One tenet was that the service should be fast, giving ops teams a lightweight, interactive, “fun” experience. This would require analyzing multi-gigabyte data sets at subsecond speeds, and doing it on a budget. Existing log management tools were often slow and clunky, so we were facing a challenge, but the good kind — an opportunity to deliver a new user experience through solid engineering.

This article describes how we met that challenge using an “old school”, brute-force approach, by eliminating layers and avoiding complex data structures. There are lessons here that you can apply to your own engineering challenges.

Read More