Optimizing AngularJS: source code now on Github

At Scalyr, we’ve been rewriting our web client using the AngularJS framework. AngularJS allows us to build our frontend codebase in a modular and testable way, while enabling a single-page application approach that can unlock maximum performance in a web frontend.

Due to the amount of data contained in our log viewer, our first stab at an AngularJS log view had performance problems. We created a set of custom Angular directives to alleviate this. We presented this work in a blog post a few weeks ago, describing how we reduced page update time from 1200 milliseconds to 35 milliseconds.

We have received a flood of requests for the source code to these directives. We’re pleased to announce that the code is now available on Github:

https://github.com/scalyr/angular

We have done some work to clean up and document the code for external use, but please understand that it was developed for our specific use cases. We hope you find it useful — please let us know!


Cloud Cost Calculator

There are many, many options for cloud server hosting nowadays. EC2 pricing alone is so complex that quite a few pages have been built to help sort it out. Even so, while comparing costs for various scenarios — on demand vs. reserved instances, “light utilization” vs. “heavy utilization” reservations, EC2 vs. other cloud providers — we here at Scalyr recently found ourselves building spreadsheets and looking up net-present-value formulas. That seemed a bit silly, so we decided to do something about it. And so we now present, without further ado: the Cloud Cost Calculator.

 

With this tool, you can:

 

  • Compare prices across cloud providers (currently Amazon, Digital Ocean, Google, Linode, and Rackspace).
  • Compare hourly and reserved / leased options on an apples-to-apples basis.
  • See your true monthly cost, amortizing any up-front payment across your expected lifetime for a given server.
  • Account for the resale value of EC2 reserved instances.
  • Display and sort by value metrics such as “GB of RAM per dollar”.
  • Restrict by region, provider, lease type, cost, and server size.

 

The combination of the true-monthly-cost computation, with “units per dollar” value metrics, enables you to ask interesting questions with just a few clicks. For instance, you can see which server gives you the most SSD storage per dollar, or find the cheapest way to run a 4-core server 8 hours/day for a year.

 

To determine monthly costs, you specify how long and how heavily you’ll use the server (e.g. 8 hours/day for 12 months). You also specify your cost of capital, which is used to convert upfront costs into a monthly equivalent. Finally, for EC2, you specify your assumptions regarding the resale value of a partially-used reserved instance on Amazon’s Reserved Instance Marketplace. (We automatically take Amazon’s 10% resale commission into account.)

Fun with data

Exploring this data was kind of fun. Unsurprisingly, Amazon’s specialty monster machines (cc2.8xlarge for CPU, cr1.8xlarge for RAM, hs1.8xlarge for disk, hi1.4xlarge for SSD) turn out to be very competitive on cost-performance for their respective specialty… though for RAM, Amazon’s m2.2xlarge instance is actually somewhat cheaper, per GB, than the cr1.8xlarge. However, Digital Ocean gives Amazon a real run for their money. Digital Ocean’s smallest offerings have outsized SSD and CPU allocations, and come out cheaper than the Amazon monsters on both terms, even for a three-year period that can take advantage of EC2 reserved instances. (Take the CPU comparisons with a grain of salt; see Notes and Caveats.)

 

If we limit ourselves to mid-sized servers, say 4GB to 32GB of RAM, the picture changes a bit. Digital Ocean and Rackspace both beat Amazon on SSD pricing, and Digital Ocean stays on top for CPU. Amazon still wins for RAM and spinning disk, in part because some competitors are moving entirely to SSD for direct-attached storage.

 

If you’re not willing to purchase a reserved instance, Amazon’s hs1.8xlarge still beats all comers for spinning disk prices, but Digital Ocean comes out on top everywhere else.

EC2 Reserved instances are under-appreciated

If you’re using Amazon, reserved instances are a great deal. Two things people don’t seem to take into account: a “light utilization” reserved instance gives you most of the benefit of a “heavy utilization” instance, with a much lower up-front cost, even if you’re planning to run the server 24 hours per day. And, the ability to resell instances on Amazon’s marketplace means that a reservation can make good financial sense even when you only need the server for a few months, or aren’t sure how long you need it.

 

In fact: if you’re going to run a server 24 hours a day, then on paper it’s cheaper to buy a reserved instance even if you only need the server for one month. (You’ll resell the reservation at the end of the month.) Of course, this assumes you have the working capital to purchase the reserved instance, and can risk an Amazon price drop or other marketplace setback. Don’t try this at home.

 

Let’s take a concrete example. Suppose you need an m3.xlarge server (15GB, 13 ECUs) for one month. An on-demand instance will cost $365 (assuming one month == 1/12 year). If you buy a one-year light-utilization lease, and resell it at the end of the month, your cost comes out to $299, even on mildly conservative assumptions (Amazon’s 10% resale commission, another 10% loss because no one will pay you the full pro-rated price, and 10% annual interest on your capital that was tied up for a month). On the same assumptions, a three-year light-utilization lease comes out even better, at $293.

 

Of course, if you plan on running a server for longer than one month, the benefits of a reservation increase rapidly. At 5 months, you save 50% by using a reservation.

Notes and caveats

We specify the number of CPU cores for each offering, alongside RAM and storage sizes. Most of these figures are fairly objective, but “CPU cores” are a squishy unit at best, and quickly become squishier in a multi-tenant service. We’ve followed this StackOverflow answer and translated EC2 Compute Units (ECUs) and Google GCEUs as 1 core == 2.75 ECUs == 2.75 GCEUs. We arbitrarily treated Google’s “f1-micro” instance as offering 0.5 GCEUs, and translate “1x priority” on Linode as equalling 0.5 cores. Finally, Rackspace advertises “vCPUs” which they say are a “physical CPU thread”; that sounds like a hyperthread, which we count as 0.5 cores. Take all this with a grain of salt.

 

Linode advertises monthly prices, and we compute costs on that basis. In practice, they will apparently pro-rate your bill if you run a server for less than a complete month. The calculator cannot currently take this into account, as there is no field to specify usage for partial months.

 

Digital Ocean charges for at most 672 hours (28 days * 24 hours/day) per month, which means the hourly rate is a bit lower for full-time usage. Our amortized cost computation handles this correctly.

 

There are many other differences between service providers. Some will affect your costs directly (several providers include some free bandwidth; GCE instances can be leased in sub-hour increments). More critically, there may be large variations in reliability and network quality, to say nothing of the availability of associated services like S3. This tool is all about price, but price is only one of the metrics you should be considering.

 

We only report pricing for basic Linux; Windows, RHEL, and other platforms are not covered.

 

EBS Optimized Instances are not listed.

Contribute

We’d love to have data for more providers, including some traditional non-cloud providers. To contribute data, just put together a JSON file and submit a pull request on our Github repository. The readme file describes the format, and you can see an example here.

 

Other improvements are welcomed as well. If you have feedback, drop us a line at contact@scalyr.com.

Credits

Plugs

If you’re interested in quantitative analysis of cloud performance, you might like our older post, A Systematic Look at EC2 I/O. And you might also be interested in our server monitoring and log analysis service, Scalyr Logs. We’ve built a highly efficient universal storage engine for server logs and metrics, enabling you to manage all your server data in one place with amazing performance — click the link to learn more or try the online demo.


Optimizing AngularJS: 1200ms to 35ms

Edit: Due to the level of interest, we’ve released the source code to the work described here: https://github.com/scalyr/angular. Also, some good discussion at Hacker News.

 

Here at Scalyr, we recently embarked on a full rewrite of our web client. Our application, Scalyr Logs, is a broad-spectrum monitoring and log analysis tool. Our home-grown log database executes most queries in tens of milliseconds, but each interaction required a page load, taking several seconds for the user.

 

A single-page application architecture promised to unlock the backend’s blazing performance, so we began searching for an appropriate framework, and identified AngularJS as a promising candidate. Following the “fail fast” principle, we began with our toughest challenge, the log view.

 

This is a real acid test for an application framework. The user can click on any word to search for related log messages, so there may be thousands of clickable elements on the page; yet we want instantaneous response for paging through the log. We were already prefetching the next page of log data, so the user interface update was the bottleneck. A straightforward AngularJS implementation of the log view took 1.2 seconds to advance to the next page, but with some careful optimizations we were able to reduce that to 35 milliseconds. These optimizations proved to be useful in other parts of the application, and fit in well with the AngularJS philosophy, though we had to break few rules to implement them. In this article, we’ll discuss the techniques we used.

 

log-view
A log of Github updates, from our live demo.

An AngularJS log viewer

At heart, the Log View is simply a list of log messages. Each word is clickable, and so must be placed in its own DOM element. A simple implementation in AngularJS might look like this:

<span class=’logLine’ ng-repeat=’line in logLinesToShow’>

 <span class=’logToken’ ng-repeat=’token in line’>{{token | formatToken}} </span>

 <br>

</span>

One page can easily have several thousand tokens. In our early tests, we found that advancing to the next log page could take several agonizing seconds of JavaScript execution. Worse, unrelated actions (such as clicking on a navigation dropdown) now had noticeable lag. The conventional wisdom for AngularJS says that you should keep the number of data-bound elements below 200. With an element per word, we were far above that level.

Analysis

Using Chrome’s Javascript profiler, we quickly identified two sources of lag. First, each update spent a lot of time creating and destroying DOM elements. If the new view has a different number of lines, or any line has a different number of words, Angular’s ng-repeat directive will create or destroy DOM elements accordingly. This turned out to be quite expensive.

 

Second, each word had its own change watcher, which AngularJS would invoke on every mouse click. This was causing the lag on unrelated actions like the navigation dropdown.

Optimization #1: Cache DOM elements

We created a variant of the ng-repeat directive. In our version, when the number of data elements is reduced, the excess DOM elements are hidden but not destroyed. If the number of elements later increases, we re-use these cached elements before creating new ones.

Optimization #2: Aggregate watchers

All that time spent invoking change watchers was mostly wasted. In our application, the data associated with a particular word can never change unless the overall array of log messages changes. To address this, we created a directive that “hides” the change watchers of its children, allowing them to be invoked only when the value of a specified parent expression changes. With this change, we avoided invoking thousands of per-word change watchers on every mouse click or other minor event.  (To accomplish this, we had to slightly break the AngularJS abstraction layer. We’ll say a bit more about this in the conclusion.)

Optimization #3: Defer element creation

As noted, we create a separate DOM element for each word in the log. We could get the same visual appearance with a single DOM element per line; the extra elements are needed only for mouse interactivity. Therefore, we decided to defer the creation of per-word elements for a particular line until the mouse moves over that line.

 

To implement this, we create two versions of each line. One is a simple text element, showing the complete log message. The other is a placeholder which will eventually be populated with an element per word. The placeholder is initially hidden. When the mouse moves over that line, the placeholder is shown and the simple version is hidden. Showing the placeholder causes it to be populated with word elements, as described next.

Optimization #4: Bypass watchers for hidden elements

We created one more directive, which prevents watchers from being executed for an element (or its children) when the element is hidden. This supports Optimization #1, eliminating any overhead for extra DOM nodes which have been hidden because we currently have more DOM nodes than data elements. It also supports Optimization #3, making it easy to defer the creation of per-word nodes until the tokenized version of the line is shown.

 

Here is what the code looks like with all these optimizations applied. Our custom directives are shown in bold.

<span class=’logLine’ sly-repeat=’line in logLinesToShow’ sly-evaluate-only-when=’logLines’>

 <div ng-mouseenter=”mouseHasEntered = true”>

   <span ng-show=’!mouseHasEntered’>{{logLine | formatLine }} </span>

   <div ng-show=’mouseHasEntered’ sly-prevent-evaluation-when-hidden>

      <span class=’logToken’ sly-repeat=’tokens in line’>{{token | formatToken }}</span>

   </div>

 </div>

 <br>

</span>

sly-repeat is our variant of ng-repeat, which hides extra DOM elements rather than destroying them. sly-evaluate-only-when prevents inner change watchers from executing unless the “logLines” variable changes, indicating that the user has advanced to a new section of the log. And sly-prevent-evaluation-when-hidden prevents the inner repeat clause from executing until the mouse moves over this line and the div is displayed.

 

This shows the power of AngularJS for encapsulation and separation of concerns. We’ve applied some fairly sophisticated optimizations without much impact on the structure of the template. (This isn’t the exact code we’re using in production, but it captures all of the important elements.)

Results

To evaluate performance, we added code to measure the time from a mouse click, until Angular’s $digest cycle finishes (meaning that we are finished updating the DOM). The elapsed time is displayed in a widget on the side of the page. We measured performance of the “Next Page” button while viewing a Tomcat access log, using Chrome on a recent MacBook Pro. Here are the results (each number is the average of 10 trials):

Data already cached

Data fetched from server

Simple AngularJS

1190 ms

1300 ms

With Optimizations

35 ms

201 ms

These figures do not include the time the browser spends in DOM layout and repaint (after JavaScript execution has finished), which is around 30 milliseconds in each implementation. Even so, the difference is dramatic; Next Page time dropped from a “stately” 1.2 seconds, to an imperceptible 35 ms (65 ms with rendering).

 

The “data fetched from server” figures include time for an AJAX call to our backend to fetch the log data. This is unusual for the Next Page button, because we prefetch the next page of logs, but may be applicable for other UI interactions. But even here, the optimized version updates almost instantly.

Conclusion

This code has been in production for two months, and we’re very happy with the results. You can see it in action at the Scalyr Logs demo site. After entering the demo, click the “Log View” link, and play with the Next / Prev buttons. It’s so fast, you’ll find it hard to believe you’re seeing live data from a real server.

 

Implementing these optimizations in a clean manner was a fair amount of work. It would have been simpler to create a single custom directive that directly generated all of the HTML for the log view, bypassing ng-repeat. However, this would have been against the spirit of AngularJS, bearing a cost in code maintainability, testability, and other considerations. Since the log view was our test project for AngularJS, we wanted to verify that a clean solution was possible. Also, the new directives we created have already been used in other parts of our application.

 

We did our best to follow the Angular philosophy, but we did have to bend the AngularJS abstraction layer to implement some of these optimizations. We overrode the Scope’s $watch method to intercept watcher registration, and then had to do some careful manipulation of Scope’s instance variables to control which watchers are evaluated during a $digest.

Next time

This article covered a set of techniques we used to optimize JavaScript execution time in our AngularJS port. We’re big believers in pushing performance to the limit, and these are just some of the tricks we’ve used. In upcoming articles, we’ll describe techniques to reduce network requests, network latency, and server execution time. We may also discuss our general experience with AngularJS and the approach we took to structuring our application code — if you’re interested in this, let us know in the comments.

Obligatory plug

At Scalyr, we’re all about improving the DevOps experience through better technology. If you’ve read this far, you should probably hop on over to scalyr.com and read more about what we’re up to.


Exploring the Github Events Firehose

Here at Scalyr, we’ve been having a lot of fun building out a high-speed query engine for log data, and a snappy UI using AngularJS. However, we haven’t had a good way to show it off: a data exploration tool is useless without data to explore. This has been a challenge when it comes to giving people a way to play with Scalyr Logs before signing up. We recently learned that Github provides a feed of all actions on public repositories. That sounded like a fun basis for a demo, so we began importing the feed. (To explore the data yourself, see the last paragraph.)

 

We’re collecting data from two sources. One is Github’s official API for retrieving events on public repositories, https://api.github.com/events. The other is https://github.com/timeline.json, an unofficial feed which contains similar data. Each provides some information not included in the other.

 

Playing around, we noticed some occasional holes in the data — here’s a graph from Oct. 19:

 

 

Looking at the message history on https://status.github.com, it seems that Github reported an outage at that time. We have an internal alert on problems with our import of Github data, but so far every blip has turned out to be at Github’s end. (Note: we love Github, and use it ourselves. Just don’t expect it to be a five-nines service.)

 

It’s fun to play around with this data. There are around 1 million events per day, which we handle very easily. One nice thing about Scalyr Logs is that it allows you to explore interactively. You can search and filter on the fly and view graphs and value distributions. I found it oddly reassuring that the word “fixed” appears much more often in commit comments than the word “broken”:

 

 

Interestingly, “fixed” has a stronger circadian cycle than “broken”. Perhaps robots are breaking things, and humans are fixing them? Graphing checkins by authors in various (self-reported) locations shows that — surprise! — San Franciscans get up later in the day than Europeans:

 

And the daily cycle seems to be pretty stable — here is a comparison of the last 48 hours, with the same period two weeks earlier:

 

 

Side note: on the day I’m writing this, Batavia, Illinois is the most common location, beating out both San Francisco and “none specified”. It turns out that a single user in that town has been quite busy this morning:

 

 

Programming languages appear to be well distributed across time zones:

 

 

Searching for random keywords can be fun. A quick search for “Unicorn” turns up things like this:

 

 

I wonder what it means to “Switch to unicorn from passenger”? This comes from https://github.com/cconstantine/wattle, if you’re curious.

 

If you’d like to take a look at this data, go to scalyr.com and click the “Try The Demo” link at the bottom of the page. And if you have suggestions for other public data sets that would be fun for us to publish — drop us a line at contact@scalyr.com!


Good News: Your Monitoring Is All Wrong

This is the first in a series of articles on server monitoring techniques. If you’re responsible for a production service, this series is for you.

In this post, I’ll describe a technique for writing alerting rules. The idea is deceptively simple: alert when normal, desirable user actions suddenly drop. This may sound mundane, but it’s a surprisingly effective way to increase your alerting coverage (the percentage of problems for which you’re promptly notified), while minimizing false alarms.

Most alerting rules work on the opposite principle — they trigger on a spike in undesired events, such as server errors. But as we’ll see, the usual approach has serious limitations. The good news: by adding a few alerts based on the new technique, you can greatly improve your alerting coverage with minimal effort.

Why is this important? Ask Microsoft and Oracle.

With almost any monitoring tool, it’s pretty easy to set up some basic alerts. But it’s surprisingly hard to do a good and thorough job, and even the pros can get it wrong. One embarrassing instance: on October 13, 2012, for about an hour, Oracle’s home page consisted entirely of the words “Hello, World”. Oracle hasn’t said anything about this incident, but the fact that it was not corrected for over an hour suggests that it took a while for the ops team to find out that anything was wrong — a pretty serious alerting failure.

A more famous example occurred on February 28th, 2012. Microsoft’s Azure service suffered a prolonged failure during which no new VMs could be launched. We know from Microsoft’s postmortem that it took 75 minutes for the first alert to trigger. For 75 minutes, no Azure VM could launch anywhere in the world, and the Azure operations team had no idea anything was wrong. (The whole incident was quite fascinating. I dissected it in an earlier post; you can find Microsoft’s postmortem linked there.)

This is just the tip of the iceberg; alerting failures happen all the time. Also common are false positives — noisy alerts that disrupt productivity or teach operators the dangerous habit of ignoring “routine” alerts.

If you see something, say something

If you’ve ridden a New York subway in recent years, you’ve seen this slogan. Riders are encouraged to alert the authorities if they see something that looks wrong — say, a suspicious package left unattended. Most monitoring alerts are built on a similar premise: if the monitoring system sees something bad — say, a high rate of server errors, elevated latency, or a disk filling up — it generates a notification.

This is sensible enough, but it’s hard to get it right. Some challenges:

  1. It’s hard to anticipate every possible problem. For each type of problem you want to detect, you have to find a way to measure instances of that problem, and then specify an alert on that measurement.
  2. Users can wake you up at 3:00 AM for no reason. It’s hard to define metrics that distinguish between a problem with your service, and a user (or their equipment) doing something odd. For instance, a single buggy sync client, frantically retrying an invalid operation, can generate a stream of server “errors” and trigger an alert.
  3. Operations that never complete, never complain. If a problem causes operations to never complete at all, those operations may not show up in the log, and your alerts won’t see anything wrong. (Eventually, a timeout or queue limit might kick in and starting recording errors that your alerting system can see… but this is chancy, and might not happen until the problem has been going on for a while.) This was a factor in the Azure outage.
  4. Detecting incorrect content. It’s easy to notice when your server trips an exception and returns a 500 status. It’s a lot harder to detect when the server returns a 200 status, but due to a bug, the page is malformed or is missing data. This was presumably why Oracle didn’t spot their “Hello, World” homepage.

Let your users be your guide

The challenges with traditional alerting can be summarized as: servers are complicated, and it’s hard to distinguish desired from undesired behavior. Fortunately, you have a large pool of human experts who can make this determination. They’re called “users”.

Users are great at noticing problems on your site. They’ll notice if it’s slow to load. They’ll notice error pages. They’ll also notice subtler things — nonsensical responses, incomplete or incorrect data, actions that never complete.

Someone might come right out and tell you you have a problem, but you don’t want to rely on that — it might take a long time for the message to work its way through to your operations team. Fortunately, there’s another approach: watch for a dropoff in normal operations. If users can’t get to your site, or can’t read the page, or their data isn’t showing up, they’ll react by not doing things they normally do — and that’s something you can easily detect.

This might seem simple, but it’s a remarkably robust way of detecting problems. An insufficient-activity alert can detect a broad spectrum of problems, including incorrect content, as well as operations that don’t complete. Furthermore, it won’t be thrown off by a handful of users doing something strange, so there will be few false alarms.

Consider the real-world incidents mentioned above. In Oracle’s case, actions that would normally occur as a result of users clicking through from the home page would have come to a screeching halt. In Microsoft’s case, the rate of successful VM launches dropped straight to zero the moment the incident began.

Red alert: insufficient dollars per second

When I worked at Google, internal lore held that the most important alert in the entire operations system checked for a drop in “advertising dollars earned per second”. This is a great metric to watch, because it rolls up all sorts of behavior. Anything from a data center connectivity problem, to a code bug, to mistuning in the AdWords placement algorithms would show up here. And as a direct measurement of a critical business metric, it’s relatively immune to false alarms. Can you think of a scenario where Google’s incoming cash takes a sudden drop, and the operations team wouldn’t want to know about it?

Concrete advice

Alongside your traditional “too many bad things” alerts, you should have some “not enough good things” alerts. The specifics will depend on your application, but you might look for a dropoff in page loads, or invocations of important actions. It’s a good idea to cover a variety of actions. For instance, if you were in charge of operations for Twitter, you might start by monitoring the rate of new tweets, replies, clickthroughs on search results, accounts created, and successful logins. Think about each important subsystem you’re running, and make sure that you’re monitoring at least one user action which depends on that subsystem.

It’s often best to look for a sudden drop, rather than comparing to a fixed threshold. You might alert if the rate of events over the last 5 minutes is 30% lower than the average over the preceding half hour. This avoids false positives or negatives due to normal variations in usage.

Note that dropoff alerts are a complement to the usual “too many bad things” alerts, not a replacement. Each approach can find problems that will be missed by the other.

A quick plug

If you liked this article, you’ll probably like Scalyr Logs, our hosted monitoring service. Logs is a comprehensive DevOps tool, combining server monitoring, log analysis, alerts, and dashboards into a single easy-to-use service. Built by experienced devops engineers, it’s designed with the practical, straightforward, get-the-job-done approach shown here.

It’s easy to create usage-based alerts in Logs. Suppose you want to alert if posts to the URL “/upload” drop by 30% in five minutes. The following expression will do the trick:

countPerSecond:5m(POST ‘/upload’) < 0.7 * countPerSecond:30m(POST ‘/upload’)

To check for drops in some other event, just change the query inside the two pairs of parentheses.

Further resources

Last year, I gave a talk on server monitoring which touched on this technique and a variety of others. You can watch it at youtube.com/watch?v=6NVapYun0Xc. Or stay tuned to this blog for more articles in this series.


Follow

Get every new post delivered to your Inbox.

Join 28 other followers