Growing a High-Performance DevOps Culture

Culture is one of those things where we all know what it is but can’t explain it. Well, according to Wikipedia, culture is “the social behavior and norms found in human societies.” But in simple words, it’s all about people: how they interact, how they behave, how they talk, and what they practice. And culture is the foundation of a successful implementation of DevOps.

John Willis, an established speaker and writer on the subject of DevOps, coined the term CAMS (culture, automation, measurement, sharing) at a talk where he explained that DevOps culture is about breaking down silos. But what I find most striking about his discussion of culture, as summarized in the DevOps Dictionary, is the observation that “fostering a safe environment for innovation and productivity is a key challenge for leadership and directly opposes our tribal managerial instincts.” So the starting point for your DevOps journey is good leadership. After that, it’s just about how to grow your team to become a high-performing one.

A high-performing team in DevOps, according to recent research, is one that

  • Does deployments often, meaning several times a day.
  • Delivers a change with a fast lead time (minutes) after it’s been pushed to a shared repository.
  • Has a short (again, minutes) mean time to recover (MTTR).
  • Has a small change failure rate (described here).

So how do you grow a high-performance DevOps culture? You create a culture that will produce a team that delivers on time with confidence in a predictable manner. Here are the things that will help you get there.

High performance gauge with Scalyr colors

Read More

Java Exceptions and How to Log Them Securely

As a security consultant, I perform assessments across a wide variety of applications. Throughout the applications I’ve tested, I’ve found it’s common for them to suffer from some form of inadequate exception handling and logging. Logging and monitoring are often-overlooked areas, and due to increased threats against web applications, they’ve been added to the OWASP Top 10 as the new number ten issue, under the name “Insufficient Logging and Monitoring.”

So what’s the problem here? Well, let’s take a look.

Java Exceptions alert sign
Read More

Getting Started Quickly with Ruby Logging

Time for us to continue with our ongoing series, in which we teach you how to get started logging quickly in a variety of programming languages. We started out the series with C#, we proceeded to cover Java, and then we wrote about Python.

So, what about tipping the scale to the side of dynamically-typed interpreted languages? That’s exactly what we’re doing today by teaching you how to get up and running with logging, using the Ruby programming language.

Today’s post will follow the basic structure that’s been used in the previous articles. It will cover

  • How to implement a very rudimentary logger.
  • A discussion on the fundamentals of logging: why bother logging, which data to log, and where to log.
  • Finally, a very simple yet realistic example of proper logging, with help from the Ruby “Logger” class.

Like the previous installments of the series, we’ll create a very simple toy app in order to demonstrate how to log. As we’ve just said, we’re going to start with a very primitive—though functional—approach, and we’ll then evolve it toward a more sophisticated and realistic solution.

Ruby With Scalyr Colors

Read More

But I’m a Dev, Not a DevOps!

My experience with DevOps began before I even knew there was a name for the approach, when my boss asked me for some help in operations. The company I worked for was small at that time, so I always had the opportunity to get my hands dirty in the release automation process. I knew a few things about servers and Linux, so I was up for the challenge. To my surprise, I loved it. I knew it wasn’t the classic way of doing operations by manually managing physical servers, firewalls, virtual machines, and the like. We were using a cloud vendor. This meant that to spin up a new server, it wasn’t necessary to know which buttons to click.

The cloud vendor had his own API and SDKs for several languages, so I never really felt like I stopped programming. Of course, that was just the tip of the iceberg because systems administration is not just about spinning up new servers, adding more storage or rebooting servers. I had to take care of the architecture and which cloud services were needed for the job. But I was sure I could apply some development skills to operations, and I did. I created some scripts that launched a new environment from scratch, made backups, and restored databases.

Then, I found out about DevOps and all its practices. And because my background was in development, I was able to work with developers and explaining in their language how they could be destroying our log files and why it was important.

So if you’re a developer new to this DevOps world, trust me. You’ll like this new way of working.

Developer with a tie considering DevOps


Read More

HTTP Monitor: What It Is and Why You Need It

One day, one of our main web APIs was down, and the first person that knew it was my boss. We were so worried about bringing the API up that we never paid attention to how he was able to be one step ahead of us. There were times when we even thought he had nothing else to do than constantly refresh the web page. But the truth is that he wasn’t doing that at all. He was using an HTTP monitor that emailed him every time the API was down, slow, or unresponsive.

It was actually lucky for us that he had that monitor: it helped everyone fix things before our clients could notice. But what is an HTTP monitor, anyway? And why else would you need it?


Illustration of Person Using HTTP Monitoring


Read More

Common Ways People Destroy Their Log Files

For this article, I’m going to set up a hypothetical scenario (but based on reality) that needs logging. We’re writing an application that automates part of a steel factory. In our application, we need to calculate the temperature to which the steel must be heated. This is the responsibility of the TemperatureCalculator class.

The class is fed a lot of parameters that come from external sensors (like current temperature of the furnace, weight of the steel, chemical composition of the steel, etc.). The sensors sometimes provide invalid values, forcing us to be creative. The engineers said that, in such a case, we should use the previous value. This isn’t something that crashes our application, but we do want to log such an event.

So the team has set up a simple logging system, and the following line is appended to the log file:

An invalid value was provided. Using previous value.

Let’s explore how this well-meant log message doesn’t actually help. In fact, combined with similar messages in our log file, the log file ends up being a giant, useless mess.


Trash Fire Depicting Way People Destroy Log Files


Read More

Get Started Quickly With Python Logging

Picking up from the previous logging articles on how to get started logging with C# and Java, today we’ll be looking at how to get up and running quickly with logging in Python.

Even if you’ve already read the previous articles, this post is worth a read. It will cover new ground, like the basics around application logging in Python and a few other things, such as

  • Configuring the logging module.
  • What to log and why.
  • The security implications of logging.

So what are you waiting for? Keep reading, and let’s get a simple project set up to begin working with.

Python Scalyr Colors with LogRead More

Sexy But Useless DevOps Trends

What’s sexy but useless? A Ferrari in a traffic jam. It’s beautiful, but all that power means nothing. When trapped in traffic, it can’t live up to its full potential.

Same with DevOps. While there are some critical DevOps functions that you absolutely need, there are some sexy but useless DevOps trends that are good to be aware of. Truth be told, there’s no recipe that will tell you how to succeed in DevOps. Everyone will have different opinions, and what worked for others might not work for you. But you can trust one thing: there are some actions that will guide you directly to frustration with DevOps.

With the amount of information out there about DevOps, you might get overwhelmed and think it’s not for you. You also might think the learning curve is too steep—that you need to change too many things before you get started. Maybe you’ll need a new team, new tools, more metrics, more time… you name it.

My advice is this: don’t get distracted by all things that people say about DevOps. These things I’m going to talk about here, for instance, are all style and no substance.


Like this Ferrari if it were stuck in a traffic jam, some DevOps trends are sexy but useless.

Read More

5 Critical DevOps Practices

DevOps is like pizza. We can’t think of pizza without considering critical ingredients: dough, sauce, cheese, and your preferred choice for vegetables and proteins. Everyone likes different toppings. In my case, I can’t think about pizza without extra cheese and meat. You might choose differently, but I think we can agree there are some ingredients that are critical for this food to be called pizza. Quality and ingredients will vary, but some things will always remain true.

Well, it’s the same with DevOps practices. There are some critical practices, and you can’t think about DevOps without considering them. Everyone will have preferred choices regarding the tools and the process, but the practice will remain and each practice complements the other.

Every critical DevOps practice takes time to get down, but the end result will be magnificent. So, let’s discuss what they are and how to implement them.

Pizza with Scalyr Colors

Read More

Real-World Applications of Increased Visibility

What can change in an organization when you increase visibility? A lot.

Previously I wrote about how providing visibility to key information is a core enabler of high-functioning, high-speed teams. When put into practice, information visibility increases can lead to transformative results. In this post I’ll use a mix of Scalyr customers and others I’ve worked with in my couple of decades here in Silicon Valley to show you concrete examples where companies have realized these benefits.

Common to all of these use cases is the elimination of “middlemen” and dramatically decreasing latency in the information retrieval process. Giving employees direct, rapid access to the information they need to make effective decisions facilitates decentralized decision-making and chips away at organizational silos. Enhancing knowledge worker productivity using this approach is not new. Harvard Business School analyzed the implications of decentralized decision-making, and GE conceptualized its path to eliminating silos more than 25 years ago. Unsurprisingly, in both cases the benefits far outweighed the costs.

Whether we’re talking about engineers or customer service specialists (and we’ll cover both) remember that Data != Information. Simply having access to data—even if it represents every event happening everywhere in your environment—isn’t enough. Care and effort must be taken to ensure that data is processed and organized to be immediately consumable by the intended audience.

As a general rule of thumb, figure that half of the work will be in gathering, storing, and calculating the raw data. The other half of the work is around the presentation and organization of information.

Engineering and SaaS Use Cases

These next examples walk through the benefits that result from giving engineers increased visibility into production environments. Similar impacts can be seen in Dev/Test environments, visibility into CI/CD pipelines, testing status, and related environments. In short, any situation with multiple teams and a potential “black box” is a candidate to reap the benefits of increased transparency.

Shortening the Product Defect Lifecycle

This is such a common—and important—use case for increased visibility that we wrote an entire post on it. Visibility is the first step in the process: Is the Customer Support team immediately alerted to issues? Can your CS and Dev teams get direct access to logs when troubleshooting? Do all of your teams have clear visibility into the same data? Answer no to any of those and your teams are wasting valuable time because they lack the visibility required to shorten the defect lifecycle.

Our customers report that their internal latency times around bug triage, inter-team escalations, and root cause analysis typically decrease by a factor of 5-10 when using Scalyr. Interestingly Scalyr customers have told us that this change matters less over time because increased visibility into log data doesn’t just shorten the product defect lifecycle—it actually decreases the number of product defects. They attribute this decrease to individual engineers’ very high engagement with the log data leading to them catching a correspondingly greater percentage of issues earlier in the development process.

Next Generation Deployment Techniques

Imagine if you will a traditional code deployment pipeline, one where the engineering team hands over a release to Ops, Ops deploys it during a specific window within which QA tests, and both Ops and Customer Support stand by post-deployment to verify the health of the running system. But if your goal is to deploy continuously, with multiple releases per week (or per day!) or partial releases via feature flags, blue/green deployments, or similar incremental deployment strategies, the traditional process quickly breaks down.

Why? In traditional environments, engineers monitor releases with prebuilt dashboards and tools (like daily email reports) but cannot access individual server logs or system/application performance metrics for the full stack. As companies move to a more integrated code release pipeline, developers need a more granular and up-to-date view of their code operating in production.

The continuous delivery model can only succeed if engineers have easy access to:

  • The current state of production systems
  • The detailed state of their code (dashboards aren’t enough)
  • All relevant log files (and when in doubt, let them see the data)

Logs as Primary Data

This next use case is slightly different since not only do employees need access to logs, but they need it fast enough to use in their typical decision-making workflow. Once you have that in place something magic happens… your logs become a primary information source, not one of last resort. The specific implications of this are pretty wide-ranging, but among Scalyr customers, the most common benefits are:

  • Better logging. Once developers know they can get to the logs for real debugging, they start putting more, and cleaner, logging events in their code.
  • Democratized access to logs. When engineers can freely explore how applications are running in production, more eyes are on the lookout for problems, engineers build code for “what is” vs. how things were described to them, and teams operate more asynchronously.
  • Better tools. Knowledge that the data you need is reliably in a central location allows enterprising teams to build specific tools to assist with team-specific issues. This is particularly powerful as over time teams build numerous small tools that would never make the official roadmaps but still provide tangible benefits.

The exact implications for you will depend on how your teams decide to make use of this new power. As the saying goes, “Garbage in, garbage out,” but clean and descriptive logs can transform a business,  as I’ll show in the next use cases.

From Engineering and SaaS to Customer Service

Visibility is not just a high-leverage tool for teams reporting to the CIO or VP of Engineering. Any team working to decentralize decision-making or increase organizational efficiency can benefit. The next two examples highlight how non-technical customer-facing teams made transformative changes by enabling employee visibility into operational metrics and data.

Improving Customer Support

Recently Return Path, a leading provider of outbound email services, granted all of their Tier 1 customer support employees direct access to the production application logs. This simple but dramatic shift reduced ticket turnaround times from three business days to about five minutes for customer issues like the following.

Previously, when a support rep received a ticket from a customer complaining that an email wasn’t delivered, the three-day investigation process went something like this:

  1. Work with the customer to verify common email client or other end-user issues weren’t to blame.
  2. Contact Ops to verify that no known issues for the application were to blame.
  3. Create a ticket for the Ops team to pull the relevant logs.
  4. Receive the logs and review the delivery status of the email(s) in question.
  5. Get back to the customer and if required, open a second ticket with Ops or Engineering for any application issues found.

Not the best experience for the customer…

Fast-forward to today and that the same ticket is handled much differently. While on the phone or chat with the customer, the support rep:

  1. Gets the customer’s message ID.
  2. Queries the application logs for the full status of that message (or any other potentially relevant messages) to identify the issue.
  3. Gives the customer an immediate answer and if required, creates a ticket for Ops or Engineering.

Not only is the customer experience dramatically improved, both the customer support and Ops teams can spend more time on actual work and less time passing around tickets.

Contact Center Employee Optimization

My last example veers off the standard software development and SaaS path to a very different type of organization: contact centers. For those of you not familiar with the space, contact centers consist of inbound customer support centers, inbound or outbound sales teams, and medium- to large-scale call centers. Contact centers have long had a multitude of metrics used to track their performance. These metrics are used for several key things, most importantly the contact center’s financial and employee performance.

A startup I once worked with called Merced Systems, stepped into the contact center space with a fairly simple proposition. If employees, frontline managers, and company executives had access to key metrics in a timely manner through a user interface that allowed them to understand the raw data, they could use that information to drive more efficient and successful customer engagements. In other words, they built a product that enabled employee visibility into contact center operational metrics and allowed their customers to operate more efficiently.

Customers realized these efficiency gains in several key areas:

  • Employees could self-optimize their actions to meet real-time goals.
  • Managers could evaluate employee performance based on actual vs. perceived performance.
  • Executives could analyze contact center performance along various dimensions.

Net result? Extremely happy customers like T-Mobile, Coca Cola, Echostar, and many others— and Merced Systems going from idea to $170m acquisition in less than 10 years. All from the simple idea that granting everyone visibility to key information leads to more efficient operations.

These examples give you some ideas on where, and how, you can apply increased visibility to your environment. If you have a story about how visibility into the right information transformed your environment, we’d love to hear it about it in the comments below!

Next time I’ll be talking about the nuts and bolts of enabling visibility in SaaS environments and where we’ve seen the biggest bang for the buck.