Application Logging Practices You Should Adopt

When talking about logging practices, you could segment the topic into three main areas of concern. Those would include, in chronological order, application logging, log aggregation, and log consumption. In plain English, you set up logging in your source code, collect the results in production, and then monitor or search those results as needed.

For aggregation and consumption, you will almost certainly rely on third-party tools (an advisable course of action). But the first concern, application logging, is really up to you as an application developer. Sure, logging frameworks exist.  ut, ultimately, you’re writing the code that uses them, and you’re dictating what goes into the log files.

And not all logging code is created equal. Your logging code can make everyone’s lives easier, or it can really make a mess. Let’s take a look at some techniques for avoiding the mess situation.

What Do You Mean by Application Logging Practices?

Before diving into the practices themselves, let’s briefly clarify what application logging practices are. In the context of this post, I’m talking specifically about things that you do in your application’s source code.

So this means how you go about using the API. How do you fit the logging API into your source code? How do you use that API? And what kinds of messages do you log?

Read More

Introduction to Continuous Integration Tools

In a sense, you could call continuous integration the lifeblood of modern software development. So it stands to reason that you’d want to avail yourself of continuous integration tools. But before we get to the continuous integration tools themselves, let me explain why I just made the claim that I did. How do I justify calling continuous integration the lifeblood of software development? Well, the practice of continuous integration has given rise to modern standards for how we collaborate around and deploy software.

What Is Continuous Integration?

If you don’t know what continuous integration is, don’t worry. You’re not alone. For plenty of people, it’s a vague industry buzzword. Even some people who think they know what it means have the definition a little muddled. The reason? Well, the way the industry defines it is a little muddled.

The industry mixes up the definitions of continuous integration itself, and the continuous integration tools that enable it. To understand this, imagine if you asked someone to define software testing, and they responded with, “oh, that’s Selenium.” “No,” you’d say, “that’s a tool that helps you with software testing — it isn’t software testing itself.” So it goes with continuous integration.

Continuous integration is conceptually simple. It’s a practice. Specifically, it’s the practice of a development team syncing its code quite regularly. Several times per day, at least.

Merge Parties: The Continuous Integration Origin Story

To understand why this matters, let me explain something about the bad old days. 20 years ago, teams used rudimentary source control tools, if any at all. Concepts like branching and merging didn’t really exist in any meaningful sense.

So, here’s how software development would go. First, everyone would start with the same basic codebase. Then, management would assign features to developers. From there, each developer would go back to his or her desk, code for months, and then declare their features done. And, finally, at the end, you’d have a merge party.

What’s a merge party? It’s what happens when a bunch of software developers all slam months worth of changes into the codebase at the same time. You’re probably wondering why anyone would call this a party when it sounds awful. Well, it got this name because it was so awful and time-consuming that teams made it into an event. They’d stay into the evenings, ordering pizza and soda, and work on this for days at a time. The “party” was partially ironic and partially a weird form of team building.

But whatever you called it, and party favors or not, it was really, really inefficient. So some forward thinking teams started to address the problem.

“What if,” they wondered, “instead of doing this all at once in the end, with a big bang, we did it much sooner?” This line of thinking led to an important conclusion. If you integrated changes all of the time, it added a little friction each day, but it saved monumental pain at the end. And it made the whole process more efficient.

Read More

In DevOps Incident Response, Plans Are Worthless, But Planning Is Everything

So said President Dwight D. “Ike” Eisenhower (more or less). His battles were fought in the trenches, not the technology stacks—but for DevOps teams, the principle holds. No plan survives contact with the enemy.

A plan is a set of instructions you can follow when you understand what needs to be done. Handy when the enemy is one you know, and can plan for. But the Enemy You Know isn’t what (literally) keeps DevOps engineers up at night.

Good DevOps teams competently respond to incidents, outages, and just plain weird stuff happening in their technology stack. Great teams go further—they see around corners, developing the instincts and skills to prepare for the unexpected. At Scalyr we start by reducing the risk of the dreaded 3 am call as much as possible. But we’d be foolish to stop there.

Read More

The Myth of the Root Cause: How Complex Web Systems Fail

Editor’s note: here at Scalyr, robust systems are a topic near and dear to our heart, so we were happy to have the chance to work closely with Mathias on this piece. It’s based on the two-part series “How Complex Web Systems Fail” originally published on his Production Ready mailing list.

how-complex-systems-fail

Distributed web-based systems are inherently complex. They’re composed of many moving parts — web servers, databases, load balancers, CDNs, and many more — working together to form an intricate whole. This complexity inevitably leads to failure. Understanding how this failure happens (and how we can prevent it) is at the core of our job as operations engineers.

In his influential paper How Complex Systems Fail, Richard Cook shares 18 sharp observations on the nature of failure in complex medical systems. The nice thing about these observations is that most of them hold true for complex systems in general. Our intuitive notions of cause-and-effect, where each outage is attributable to a direct root cause, are a poor fit to the reality of modern systems.

In this post, I’ll translate Cook’s insights into the context of our beloved web systems and explore how they fail, why they fail, how you can prepare for outages, and how you can prevent similar failures from happening in the future…

Read More

So You’ve Been Paged: A Guide to Incident Response (For Those Who Hate Being Paged)

One of the inevitable joys of working in DevOps is “the page” — that dreaded notification from your alerting system that something has gone terribly wrong…and you’re the lucky person who gets to fix it.

Here at Scalyr, we’ve got a few decades of collective DevOps experience and we’ve all been on the receiving end of a page. Even though we do our best to avoid being woken up, it happens.

In this post, we’re going to put some of that experience to use and show you how to handle an incident the right way. You’ll learn not only how to fix the immediate problem, but how to grow from the experience and set your team up for smooth sailing in the future.

Ugh...Phones.

Read More