<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Scalyr Blog</title>
	<atom:link href="http://blog.scalyr.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.scalyr.com</link>
	<description></description>
	<lastBuildDate>Wed, 22 May 2013 11:43:03 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='blog.scalyr.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Scalyr Blog</title>
		<link>http://blog.scalyr.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://blog.scalyr.com/osd.xml" title="Scalyr Blog" />
	<atom:link rel='hub' href='http://blog.scalyr.com/?pushpress=hub'/>
		<item>
		<title>&#8220;Benchmarking in the Cloud&#8221; talk online</title>
		<link>http://blog.scalyr.com/2012/12/10/benchmarking-in-the-cloud-talk-online/</link>
		<comments>http://blog.scalyr.com/2012/12/10/benchmarking-in-the-cloud-talk-online/#comments</comments>
		<pubDate>Mon, 10 Dec 2012 18:51:45 +0000</pubDate>
		<dc:creator>scalyr</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.scalyr.com/?p=266</guid>
		<description><![CDATA[Amazon has posted the talks from re:Invent on YouTube. The video from the EBS session is here. My brief presentation on &#8220;Benchmarking in the Cloud&#8221; starts at the 30:16 mark (direct link). You can download my slides here. &#160; It was a terrific conference. The pace of development, and just plain enthusiasm and energy, around [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scalyr.com&#038;blog=30473437&#038;post=266&#038;subd=scalyr&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Amazon has posted the talks from re:Invent on YouTube. The video from the EBS session is <a href='http://www.youtube.com/watch?v=vXkBvuAM7T4&amp;list=PLhr1KZpdzukeEg3ipIKj_DEz0kjlNzAG7&amp;index=6'>here</a>. My brief presentation on &#8220;Benchmarking in the Cloud&#8221; starts at the 30:16 mark (<a href='http://www.youtube.com/watch?v=vXkBvuAM7T4&amp;list=PLhr1KZpdzukeEg3ipIKj_DEz0kjlNzAG7&amp;index=6#t=30m16s'>direct link</a>). You can download my slides <a href='http://scalyr.files.wordpress.com/2012/12/benchmarks_in_the_cloud.pdf'>here</a>.</p>
<p>&nbsp;</p>
<p>It was a terrific conference. The pace of development, and just plain enthusiasm and energy, around cloud services in general and AWS in particular is just amazing. I do recommend checking out some of the talks if you have time.</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scalyr.com&#038;blog=30473437&#038;post=266&#038;subd=scalyr&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scalyr.com/2012/12/10/benchmarking-in-the-cloud-talk-online/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/7b1ce707b69e2718c05e97dff2dc6daf?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">scalyr</media:title>
		</media:content>
	</item>
		<item>
		<title>Server Monitoring Talk Now Online</title>
		<link>http://blog.scalyr.com/2012/11/26/server_monitoring_talk/</link>
		<comments>http://blog.scalyr.com/2012/11/26/server_monitoring_talk/#comments</comments>
		<pubDate>Tue, 27 Nov 2012 03:29:42 +0000</pubDate>
		<dc:creator>scalyr</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.scalyr.com/?p=256</guid>
		<description><![CDATA[The video to my talk on server monitoring (&#8220;Famous Outages, and How To Not Have Them&#8221;) is now available: &#160; &#160; Thanks to Box for providing the venue and a good crowd, and thanks to the crowd for a great response. The talk is aimed at anyone who is running a production system, large or [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scalyr.com&#038;blog=30473437&#038;post=256&#038;subd=scalyr&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>The video to my talk on server monitoring (&#8220;Famous Outages, and How To Not Have Them&#8221;) is now available:</p>
<p>&nbsp;</p>
<p><span class='embed-youtube' style='text-align:center; display: block;'><iframe class='youtube-player' type='text/html' width='590' height='362' src='http://www.youtube.com/embed/6NVapYun0Xc?version=3&#038;rel=1&#038;fs=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;wmode=transparent' frameborder='0'></iframe></span></p>
<p>&nbsp;</p>
<p>Thanks to Box for providing the venue and a good crowd, and thanks to the crowd for a great response. The talk is aimed at anyone who is running a production system, large or small. The focus is on how to get good monitoring coverage for a reasonable investment in effort; spiced up with plenty of stories about real-world production outages.</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scalyr.com&#038;blog=30473437&#038;post=256&#038;subd=scalyr&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scalyr.com/2012/11/26/server_monitoring_talk/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/7b1ce707b69e2718c05e97dff2dc6daf?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">scalyr</media:title>
		</media:content>
	</item>
		<item>
		<title>Cloud Benchmarks presentation at re: Invent</title>
		<link>http://blog.scalyr.com/2012/11/26/cloud-benchmarks-presentation-at-re-invent/</link>
		<comments>http://blog.scalyr.com/2012/11/26/cloud-benchmarks-presentation-at-re-invent/#comments</comments>
		<pubDate>Mon, 26 Nov 2012 19:14:15 +0000</pubDate>
		<dc:creator>scalyr</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.scalyr.com/?p=246</guid>
		<description><![CDATA[I&#8217;ll be speaking briefly on the subject of Cloud Benchmarks at Amazon&#8217;s re: Invent conference, in Las Vegas this week. This will be a brief presentation during the &#8220;Using Amazon Elastic Block Store&#8221; session, 2:05 Wednesday afternoon in Venetian B. If you happen to be at the conference, come check it out &#8212; if not [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scalyr.com&#038;blog=30473437&#038;post=246&#038;subd=scalyr&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I&#8217;ll be speaking briefly on the subject of Cloud Benchmarks at Amazon&#8217;s re: Invent conference, in Las Vegas this week. This will be a brief presentation during the &#8220;Using Amazon Elastic Block Store&#8221; session, 2:05 Wednesday afternoon in Venetian B. If you happen to be at the conference, come check it out &#8212; if not for my presentation, then for Scot VanDenPlas, devops lead for the noted Obama for America technology effort, who will be speaking in the same session.</p>
<p>&nbsp;</p>
<p>We&#8217;ll be around the show on Wednesday and Thursday. If you&#8217;re going to be there and would like to chat (about server monitoring, cloud benchmarks, or anything else), drop me a line at <a href='mailto:steve@scalyr.com'>steve@scalyr.com</a>.</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scalyr.com&#038;blog=30473437&#038;post=246&#038;subd=scalyr&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scalyr.com/2012/11/26/cloud-benchmarks-presentation-at-re-invent/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/7b1ce707b69e2718c05e97dff2dc6daf?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">scalyr</media:title>
		</media:content>
	</item>
		<item>
		<title>Tech Talk: Famous Outages, and How To Not Have Them</title>
		<link>http://blog.scalyr.com/2012/11/12/tech-talk-famous-outages-and-how-to-not-have-them/</link>
		<comments>http://blog.scalyr.com/2012/11/12/tech-talk-famous-outages-and-how-to-not-have-them/#comments</comments>
		<pubDate>Mon, 12 Nov 2012 22:03:13 +0000</pubDate>
		<dc:creator>scalyr</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.scalyr.com/?p=236</guid>
		<description><![CDATA[This Wednesday at 6:00 PM, I&#8217;ll be giving a talk on server monitoring at Box headquarters in Los Altos, California. If you&#8217;re in the area, it should be fun. If not, we&#8217;ll be posting the video on YouTube later. Register (it&#8217;s free!) at: &#160; http://boxtechtalks.eventbrite.com/ &#160; Your company is growing rapidly and becoming more successful [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scalyr.com&#038;blog=30473437&#038;post=236&#038;subd=scalyr&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>This Wednesday at 6:00 PM, I&#8217;ll be giving a talk on server monitoring at Box headquarters in Los Altos, California. If you&#8217;re in the area, it should be fun. If not, we&#8217;ll be posting the video on YouTube later. Register (it&#8217;s free!) at:</p>
<p>&nbsp;</p>
<p><a href="http://boxtechtalks.eventbrite.com/">http://boxtechtalks.eventbrite.com/</a></p>
<p>&nbsp;</p>
<p><em>Your company is growing rapidly and becoming more successful every day. You have a team that actively does server monitoring. Or maybe you are still too small to dedicate resources to it. You think you are prepared for the worst&#8230; and then seemingly out of the blue, your site goes down and it feels like the world has ended. What do you do? What went wrong? How could you have prevented it?</em></p>
<p>&nbsp;</p>
<p><em>Steve Newman knows this pain. In this talk, he will discuss going beyond the basics of server monitoring: to detect subtle problems before your users do, to use forensic techniques for chasing down non-reproducible bugs, to actively do capacity planning, and more.</em></p>
<p>&nbsp;</p>
<p><em>The talk will be built around a series of postmortems of real-world incidents, some of which made the newspapers.</em></p>
<p>&nbsp;</p>
<p><em>Come hear one of the founding fathers of Google Docs talk at Box!</em></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scalyr.com&#038;blog=30473437&#038;post=236&#038;subd=scalyr&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scalyr.com/2012/11/12/tech-talk-famous-outages-and-how-to-not-have-them/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/7b1ce707b69e2718c05e97dff2dc6daf?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">scalyr</media:title>
		</media:content>
	</item>
		<item>
		<title>EC2 Benchmark Followup (Source + Data)</title>
		<link>http://blog.scalyr.com/2012/10/29/ec2-benchmark-followup-source-data/</link>
		<comments>http://blog.scalyr.com/2012/10/29/ec2-benchmark-followup-source-data/#comments</comments>
		<pubDate>Mon, 29 Oct 2012 23:38:40 +0000</pubDate>
		<dc:creator>scalyr</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.scalyr.com/?p=223</guid>
		<description><![CDATA[Many people have asked for the source code behind our recent post&#160;on EC2 I/O performance. After some minimal cleanup, we have now posted the source code on Github: https://github.com/scalyr/iobench. We’ve also created a discussion group for this work: https://groups.google.com/forum/#!forum/scalyr-cloud-benchmarks. There were also a few requests for the raw data. We have now posted it, as [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scalyr.com&#038;blog=30473437&#038;post=223&#038;subd=scalyr&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><span>Many people have asked for the source code behind our recent </span><span><a href="http://blog.scalyr.com/2012/10/16/a-systematic-look-at-ec2-io/">post</a></span><span>&nbsp;on EC2 I/O performance. After some minimal cleanup, we have now posted the source code on Github: </span><span><a href="https://github.com/scalyr/iobench">https://github.com/scalyr/iobench</a></span><span>. We’ve also created a discussion group for this work: </span><span style="background-color:#ffffff;"><a href="https://groups.google.com/forum/#!forum/scalyr-cloud-benchmarks">https://groups.google.com/forum/#!forum/scalyr-cloud-benchmarks</a></span><span>.</span></p>
<p class="emptyP"><span></span></p>
<p><span>There were also a few requests for </span><span>the raw data. We have now posted it, as two separate archives: </span><span><a href="http://scalyr.files.wordpress.com/2012/10/iobench_1.zip">iobench_1.zip</a></span><span>&nbsp;and </span><span><a href="http://scalyr.files.wordpress.com/2012/10/iobench_2.zip">iobench_2.zip</a></span><span>. These correspond to the two rounds of benchmarks described in the previous post (see the “Methodology” section). The first round measured performance for different thread counts; the second round measured only the “optimal” threadcount for each configuration, over a longer period of time. The remainder of this post describes the format of these data archives.</span></p>
<p class="emptyP"><span></span></p>
<p><span>Each archive contains 8 subdirectories (“trial1”, “trial2”, etc.), corresponding to the eight tested configurations: small/ephemeral, small/ebs1, small/ebs4, medium/ephemeral, large/ephemeral, large/ebs, large/ebs4, and xlarge/ephemeral respectively. Within each subdirectory is an “output” directory, which contains many numbered files; the numbers correspond to the EC2 instances being benchmarked.</span></p>
<p class="emptyP"><span></span></p>
<p><span>Of primary interest are the JSON files (json.1, json.2, etc.). These summarize the results of the benchmark runs. Each line corresponds to a single benchmark, and is in JSON format with the following structure:</span></p>
<p class="emptyP"><span></span></p>
<p class="code"><span>{<br /> &nbsp;&quot;fileSize&quot;: 85899345920, &nbsp; &nbsp; &nbsp; // size of data file<br /> &nbsp;&quot;launchTime&quot;: 2367,<br /> &nbsp;&quot;runtime&quot;: 120, &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;// runtime for this benchmark (secs)<br /> &nbsp;&quot;bucketDuration&quot;: 30, &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;// duration of a time bucket (secs)<br /> &nbsp;&quot;operations&quot;: [<br /> &nbsp; &nbsp;{<br /> &nbsp; &nbsp; &nbsp;&quot;signature&quot;: &quot;read,4K,4K&quot;, // operation tested (here, 4K reads)<br /> &nbsp; &nbsp; &nbsp;&quot;threadCount&quot;: 8, &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;// number of I/O threads used<br /> &nbsp; &nbsp; &nbsp;&quot;total&quot;: {HISTOGRAM}, &nbsp; &nbsp; &nbsp;// summarizes all operations<br /> &nbsp; &nbsp; &nbsp;&quot;timeBuckets&quot;: [<br /> &nbsp; &nbsp; &nbsp; &nbsp;{HISTOGRAM}, &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // see below<br /> &nbsp; &nbsp; &nbsp; &nbsp;{HISTOGRAM},</span></p>
<p class="code"><span>&nbsp; &nbsp; &nbsp; &nbsp; ...<br /> &nbsp; &nbsp; &nbsp;]<br /> &nbsp; &nbsp;}<br /> &nbsp;]<br />}</span></p>
<p class="emptyP"><span></span></p>
<p><span>We divide the benchmark execution period into buckets. In this example, the benchmark ran for 120 seconds, with 30-second buckets. The timeBuckets array contains a histogram per bucket, reporting on the runtime of all operations completed during that bucket. The “total” field contains a histogram for all operations in the entire benchmark (i.e. summing across time buckets). Note that the timeBuckets array generally contains one extra entry, reflecting straggler operations that completed just after the nominal benchmark runtime.</span></p>
<p class="emptyP"><span></span></p>
<p><span>Each histogram has the following structure:</span></p>
<p class="emptyP"><span></span></p>
<p class="code"><span>{<br /> &nbsp;&quot;count&quot;: 19516, // total number of operations reported here<br /> &nbsp;&quot;errorCount&quot;: 0, // number of failed operations<br /> &nbsp;&quot;minValue&quot;: 6966, // minimum runtime (nanos) for any operation<br /> &nbsp;&quot;maxValue&quot;: 581407940, // maximum runtime (nanos) for any operation</span></p>
<p class="code"><span>&nbsp; &quot;totalValue&quot;: 959992052130, // total runtime (nanos) for all ops<br /> &nbsp;&quot;bucketRatio&quot;: 1.1,</span></p>
<p class="code"><span>&nbsp; &quot;firstBucketStart&quot;: 6727.4999493256,<br /> &nbsp;&quot;buckets&quot;: [...],<br /> &nbsp;&quot;pinMinimum&quot;: 1000,<br /> &nbsp;&quot;pinMaximum&quot;: 10000000000<br />}</span></p>
<p class="emptyCodeP"><span></span></p>
<p><span>Operation runtimes are measured in nanoseconds. Each runtime is pinned to the range [pinMinimum ... pinMaximum], and then placed in a bucket. Each entry in the buckets array indicates the number of operations whose runtime fell in a particular range. The range for buckets[k] is [B * 1.1^k … B * 1.1^(k+1)], where B is firstBucketStart. In other words, the largest value falling into a bucket is 1.1 times the smallest value, and the smallest value for the first bucket is firstBucketStart. The code behind all this is in </span><span><a href="https://github.com/scalyr/iobench/blob/master/src/Histogram.java">Histogram.java</a></span><span>.</span></p>
<p class="emptyP"><span></span></p>
<p><span>Also of conceivable interest are the files run.out.1, run.out.2, etc. These contain the raw stdout from the benchmark tool. The contents are essentially the same as the json files, with some additional logging noise.</span></p>
<p class="emptyP"><span></span></p>
<p><span>If you have questions, please post on the </span><span><a href="https://groups.google.com/forum/#!forum/scalyr-cloud-benchmarks">discussion group</a></span><span>.</span></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scalyr.com&#038;blog=30473437&#038;post=223&#038;subd=scalyr&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scalyr.com/2012/10/29/ec2-benchmark-followup-source-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/7b1ce707b69e2718c05e97dff2dc6daf?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">scalyr</media:title>
		</media:content>
	</item>
		<item>
		<title>Even Stranger than Expected: a Systematic Look at EC2 I/O</title>
		<link>http://blog.scalyr.com/2012/10/16/a-systematic-look-at-ec2-io/</link>
		<comments>http://blog.scalyr.com/2012/10/16/a-systematic-look-at-ec2-io/#comments</comments>
		<pubDate>Tue, 16 Oct 2012 20:45:59 +0000</pubDate>
		<dc:creator>scalyr</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.scalyr.com/?p=129</guid>
		<description><![CDATA[At Scalyr, we’re building a large-scale storage system for timeseries and log data (see Introducing Scalyr Logs). To make good design decisions, we need hard data about EC2 I/O performance. Plenty&#160;of&#160;data&#160;has been published on this topic, but we couldn’t really find the answers we needed. Most published data is specific to a particular application or [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scalyr.com&#038;blog=30473437&#038;post=129&#038;subd=scalyr&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><span>At Scalyr, we’re building a large-scale storage system for timeseries and log data (see </span><span><a href="http://blog.scalyr.com/2012/10/09/introducing-scalyr-logs/">Introducing Scalyr Logs</a></span><span>)</span><span>. To make good design decisions, we need hard data about EC2 I/O performance.</span></p>
<p class="emptyP"><span></span></p>
<p><span><a href="http://tech.blog.greplin.com/aws-best-practices-and-benchmarks">Plenty</a></span><span>&nbsp;</span><span><a href="http://blog.cloudharmony.com/2010/06/disk-io-benchmarking-in-cloud.html">of</a></span><span>&nbsp;</span><span><a href="http://techblog.netflix.com/2012/07/benchmarking-high-performance-io-with.html">data</a></span><span>&nbsp;has been published on this topic, but we couldn’t really find the answers we needed. Most published data is specific to a particular application or EC2 configuration, or was collected from a small number of instances and hence is statistically suspect. (More on this below.)</span></p>
<p class="emptyP"><span></span></p>
<p><span>Since the data we wanted wasn’t readily available, we decided to collect it ourselves. For the benefit of the community, we’re presenting our results here. These tests involved over 1000 EC2 instances, $1000 in AWS charges, and billions of I/O operations.</span></p>
<h3><span>Stranger Than You Thought</span></h3>
<p style="text-align:center;"><img height="170" src="http://scalyr.files.wordpress.com/2012/10/iobench_tangle1.png?w=329&#038;h=170" width="329" /></p>
<p><span>This graph plots performance over time, for 45 EC2 instances performing a simple steady-state benchmark. The fluctuations indicate that performance of an individual instance can vary widely over time. If you look carefully, you can see that some instances are much more stable and better-performing than others.<br />
</span></p>
<p class="emptyP"><span></span></p>
<p><span>Analyzing the data, we found many patterns. Some things, such as the variations in performance, we&#8217;d anticipated. Others came as a surprise; the relationship between instance size, storage type, and performance is more complex than we&#8217;d previously seen reported. In this post, we present a variety of findings that we hope are of interest to anyone interested in cloud computing, and that may help you to make better design decisions and avoid performance pitfalls.</span></p>
<p class="emptyP"><span></span></p>
<p><span>I&#8217;ll discuss methodology in a later section, but here are the Cliff’s Notes: we tested small reads and writes (“read4K” and “write4K”), large synchronous reads and writes (“read4M” and “write4M”), and small mostly-asynchronous writes (“write4K/64”). We tested a variety of EC2 instance sizes, using instance storage or EBS, on single drives or RAID-0 groups. Each combination was repeated on dozens of EC2 instances.</span></p>
<p class="emptyP"><span></span></p>
<h3><a></a><span>Cost Effectiveness</span></h3>
<p><span>This chart shows which configurations give the best bang for the buck &#8212; operations per dollar:</span></p>
<p class="emptyP"><span></span></p>
<p style="text-align:center;"><img height="316" src="http://scalyr.files.wordpress.com/2012/10/iobench_costeffectiveness.png?w=571&#038;h=316" width="571" /></p>
<p><span>These figures reflect EC2 hourly rates and EBS I/O charges, but not EBS monthly storage fees (which aren’t affected by usage). Rates are for on-demand instances in the us-east region. Reserved or spot instances would reduce EC2 charges substantially, but not EBS charges, meaning that non-EBS instances would look better on the chart. The next chart shows cost effectiveness for bulk transfers:</span></p>
<p class="emptyP"><span></span></p>
<p style="text-align:center;"><img height="278" src="http://scalyr.files.wordpress.com/2012/10/iobench_costeffectivenessbulk.png?w=489&#038;h=278" width="489" /></p>
<p><span>Here, ephemeral storage has a huge advantage, which reserved instances would only amplify.</span></p>
<h3><a></a><span>Impact of RAID</span></h3>
<p><span>It&#8217;s widely held that the best EBS performance is obtained by RAIDing together multiple drives. Here&#8217;s what we found:</span></p>
<p class="emptyP"><span></span></p>
<p style="text-align:center;"><img height="371" src="http://scalyr.files.wordpress.com/2012/10/iobench_raidspeedup.png?w=600&#038;h=371" width="600" /></p>
<p><span>Each bar represents the throughput ratio of ebs4 to ebs on one specific test. In other words, the speedup moving from one to four EBS volumes. Blue reflects m1.small instances, red is m1.large. Each quantity represents an average across 45 instances. RAID offered a substantial benefit for small operations (especially reads), but &#8212; surprisingly &#8212; not much for bulk transfers. (Note, we did not make much attempt at tuning our filesystem or RAID configuration. See the Methodology section.)</span></p>
<h3><a></a><span>Impact of Instance Size</span></h3>
<p><span>Amazon states that larger EC2 instances have &#8220;higher&#8221; I/O performance, but they don&#8217;t quantify that. Our data:</span></p>
<p style="text-align:center;"><img height="371" src="http://scalyr.files.wordpress.com/2012/10/iobench_instancesizespeedup.png?w=600&#038;h=371" width="600" /></p>
<p><span>For ephemeral storage, m1.medium was hardly better than m1.small, but m1.large and m1.xlarge show a substantial benefit. (The lackluster performance of m1.medium is not surprising: it has the same number of instance drives as m1.small &#8212; one &#8212; and the same advertised I/O performance, “low”.)</span></p>
<p class="emptyP"><span></span></p>
<p><span>For EBS, m1.large shows little benefit over m1.small.</span></p>
<p class="emptyP"><span style="font-style:italic;"></span></p>
<p><span style="font-style:italic;">Shameless plug: if you’ve read this far, you’re probably doing interesting things in the cloud. If you’re doing interesting things, you have “interesting” monitoring challenges. And in that case, you’re just the sort of person we had in mind when we built Scalyr Logs. Check out the </span><span style="font-style:italic;"><a href="http://blog.scalyr.com/2012/10/09/introducing-scalyr-logs/">blog post</a></span><span style="font-style:italic;">&nbsp;and the </span><span style="font-style:italic;"><a href="https://www.scalyr.com/logs">product page</a></span><span style="font-style:italic;">, and </span><span style="font-style:italic;"><a href="https://www.scalyr.com/logSignup">register here</a></span><span style="font-style:italic;">&nbsp;if you’d like to try it out.</span></p>
<h3><a></a><span>Bad Apples</span></h3>
<p><span>You often hear that EC2 I/O is subject to a “bad apple” effect: some instances have markedly poor I/O performance, and you can get big gains by detecting these instances and moving off of them. We found that this effect is real, but applies much more strongly to some use cases than others. Consider the following two charts:</span></p>
<p class="emptyP"><span></span></p>
<p style="text-align:center;"><img height="149" src="http://scalyr.files.wordpress.com/2012/10/iobench_badapples1.png?w=412&#038;h=149" width="412" /></p>
<p style="text-align:center;"><img height="143" src="http://scalyr.files.wordpress.com/2012/10/iobench_badapples2.png?w=295&#038;h=143" width="295" /></p>
<p class="emptyP"><span></span></p>
<p><span>These are performance histograms: the horizontal axis shows operations per second, and the vertical axis shows the number of instances exhibiting a particular level of performance. A tall, narrow histogram indicates performance that is consistent across instances. Note that the horizontal axis uses a log scale.</span></p>
<p class="emptyP"><span></span></p>
<p><span>The first histogram is for bulk writes (write4M) on the small/ebs configuration. 45 instances were tested; the histogram presents the mean throughput for each of those 45 instances. So, the slowest instance sustained roughly 0.3 writes per second (1.2MB/sec write bandwidth), while the fastest sustained a bit over 10 writes/second (40MB/sec) &#8212; a difference of more than 30x! Most instances were clustered around 7 writes/second, but 5 out of 45 managed 0.8 / second or less.</span></p>
<p class="emptyP"><span></span></p>
<p><span>The second histogram</span><span>&nbsp;is </span><span style="font-style:italic;">exactly the same</span><span>, but for read4K operations on medium/ephemeral instances. Here, all 45 instances fell in the range of 200 to 300 operations/second; a best/worst ratio of just 1.5 : 1.</span></p>
<p class="emptyP"><span></span></p>
<p><span>So, if you’re doing bulk writes on EBS, you probably need to worry about bad instances. Small reads on EC2 instance storage, not so much. In general, bulk transfers (read4M, write4M) show more variation across instances than small random I/Os (read4K, write4K, write4K/64), and EBS shows more variation than ephemeral storage, but there are exceptions. You’ll find systematic results in a later section (&#8220;Variation Across Instances&#8221;).</span></p>
<h3><a></a><span>Impact of Parallelism</span></h3>
<p><span>This chart shows throughput as a function of thread count. Each graph shows results for a particular operation on a particular storage type. Each line shows a particular EC2 configuration.</span></p>
<p class="emptyP"><span style="font-style:italic;"></span></p>
<p style="text-align:center;"><img height="866" src="http://scalyr.files.wordpress.com/2012/10/iobench_throughputbythreadcount.png?w=732&#038;h=866" width="732" /></p>
<p class="emptyP"><span></span></p>
<p class="emptyP"><span></span></p>
<p><span>We can see that parallelism often improves throughput, but diminishing returns set in quickly. For all operations except read4K, and write4K on ebs4, 5 threads are enough. In fact, a single thread is enough for good throughput in many cases. But in some circumstances, small reads can benefit from as many as 48 threads.</span></p>
<p class="emptyP"><span></span></p>
<p><span>Interestingly, for large reads on m1.medium / ephemeral, throughput </span><span style="text-decoration:underline;">drops</span><span>, quite dramatically, when more than one thread is used. This effect held up consistently throughout multiple test runs, each on its own fresh set of 30 instances, on multiple days. The fact that we only observed this effect only on m1.medium serves to highlight the importance of testing the exact configuration you plan to use.</span></p>
<p class="emptyP"><span></span></p>
<p><span>We can also see in this chart that EBS offers inexplicably good performance for small writes. For instance, write4K on large/ebs executes over 800 operation/second with a single thread. This implies a mean latency of roughly one millisecond &#8212; barely enough time for a network roundtrip to an EBS server. Either Amazon is doing something very clever, or </span><span>EBS does not actually wait for durability before acknowledging a flush command.</span></p>
<h3><a></a><span>Variation Across Instances</span></h3>
<p><span>This chart shows how performance varies across 45 nominally identical instances. Each graph presents one benchmark. The vertical axis plots latency, and the horizontal axis plots the individual instances, sorted from best to worst. The five lines represent various latency percentiles, as indicated by the color key. If all instances behaved identically, the graphs would be quite flat, especially as we’re using a log scale.</span></p>
<p class="emptyP"><span></span></p>
<p style="text-align:center;"><img height="1251" src="http://scalyr.files.wordpress.com/2012/10/iobench_latencybyinstance.png?w=703&#038;h=1251" width="703" /></p>
<p class="emptyP"><span></span></p>
<p><span>Remember that the horizontal axis shows EC2 instances, not time. A sloped graph indicates that some instances were faster than others. Continuous slopes indicate gradual variations, while spikes indicate that some instances behaved very differently.</span></p>
<p class="emptyP"><span></span></p>
<p><span>We can see that ephemeral storage latency is fairly consistent in general, though it’s not uncommon to have a few outliers. (For instance, for read4K on m1.xlarge / ephemeral, one instance appears to have mean latency more than 4x worse than the best instance.) Our sample size is too small to properly judge the prevalence of these outliers. Also note that bulk transfers show more variation than small operations.</span></p>
<p class="emptyP"><span></span></p>
<p><span>EBS shows more variation, except for write4K/64. writeFlush operations (write4K and write4M) are especially bad, with 10:1 variations the norm.</span></p>
<p class="emptyP"><span></span></p>
<p><span>Note that larger instances don’t always appear faster here, because we’re working them harder (more threads), and these graphs show latency rather than throughput.</span></p>
<h3><a></a><span>Performance over time (A Twisty Maze of Instances, All Different)</span></h3>
<p><span>Next, we examine how throughput varies over time. Each graph contains 45 lines, one per EC2 instance. The vertical axis is throughput, and the horizontal axis is time. The total time span is 10 minutes. This is not long enough to show long-term trends, but with 45 test instances, there is some scope for infrequent events to manifest.</span></p>
<p class="emptyP"><span></span></p>
<p><span>The vertical axis on each graph is normalized to the maximum sample for that graph, so you should not attempt to compare values across graphs.</span></p>
<p class="emptyP"><span></span></p>
<p><span>To my mind, this is the most remarkable chart in the entire investigation, because no two graphs look alike. If throughput were consistently stable over across instance and time, each graph would be a single thin bar. Instead, we see collections of horizontal lines (indicating performance that varies from instance to instance, but is steady over time); widely spaced horizontal lines (more variance between instances); wiggly lines (performance oscillating within a stable band); wild swings; gradual upward or downward motion; high-performing outliers; low-performing outliers; bimodal distributions; and more.</span></p>
<p class="emptyP"><span></span></p>
<p><span>There do seem to be a few general trends. EBS is noisier than ephemeral (EC2 instance) disk, which is is not surprising &#8212; EBS has more moving parts. And on EBS, write performance is more variable than read performance. (Which is not to say that EBS reads are more consistently fast. It&#8217;s more that reads are consistently slow, while EBS writes are usually-but-not-reliably faster.)</span></p>
<p class="emptyP"><span></span></p>
<p><span>We see confirmation here that in some cases there are “good” and “bad” instances, but we don’t have enough data to determine whether this is stable over long periods of time &#8212; whether good instances remain good, and bad instances remain bad.</span></p>
<p class="emptyP"><span style="font-style:italic;"></span></p>
<p><img height="1156" src="http://scalyr.files.wordpress.com/2012/10/iobench_throughputovertime.png?w=711&#038;h=1156" width="711" /></p>
<p class="emptyP"><span></span></p>
<h3><a></a><span>Single-threaded latency</span></h3>
<p><span>This chart shows operation latency. Most of results we’ve examined a thread count that optimizes throughput, but here we use a single thread, minimizing latency. The X axis indicates which instance setup was used (see the legend at the bottom of the chart), and the Y axis shows latency in milliseconds. Values are based on aggregate performance across 30 instances. The left column shows median, mean, and 90th percentile latency; the right column shows 99th and 99.9th percentile. I don’t present results for bulk transfers (read4M and write4M), as these are inherently throughput-oriented.</span></p>
<p class="emptyP"><span></span></p>
<p style="text-align:center;"><img height="634" src="http://scalyr.files.wordpress.com/2012/10/iobench_latencybyinstancetype.png?w=713&#038;h=634" width="713" /></p>
<p class="emptyP"><span></span></p>
<p><span>We can see that small reads (read4K) take roughly 10ms on every machine configuration. This makes sense, as the benchmark is designed to force a disk seek for every read, and seek time is likely to dominate other factors such as network latency. (There is a slight decrease in read latency on larger instances. That’s probably a caching artifact &#8212; m1.xlarge instances have enough RAM to cache roughly 20% of our 80GB file.)</span></p>
<p class="emptyP"><span></span></p>
<p><span>The write4K results seem difficult to explain. For instance, why do larger instances show such drastically lower latency? And how can writes possibly be made durable so quickly, especially for EBS where a network hop is involved?</span></p>
<p class="emptyP"><span style="font-style:italic;"></span></p>
<p><span>For write4K/64, the median and 90th percentile latencies hug the floor, which make sense as over 98% of these operations are not synchronously flushed and hence don’t wait for disk. The mean and higher percentiles follow roughly the same pattern as write4K.</span></p>
<p class="emptyP"><span></span></p>
<p><span>For a final bit of fun, let’s look at a detailed histogram for one benchmark:</span></p>
<p class="emptyP"><span></span></p>
<p style="text-align:center;"><img height="305" src="http://scalyr.files.wordpress.com/2012/10/iobench_histogram.png?w=663&#038;h=305" width="663" /></p>
<p><span>This shows latency for write4K operations on the xlarge/ephemeral configuration. The horizontal axis shows latency (log scale), and the vertical axis shows the number of individual operations with that particular latency. Each spike presumably indicates a distinct scenario &#8212; cache hit; cache miss; I/O contention with other tenants of the physical machine; etc. Clearly, several mechanisms are coming into play, but it’s not obvious to me what they all might be.</span></p>
<h3><a></a><span>Methodology</span></h3>
<p><span>“I/O performance” is a complex topic, involving the filesystem, device drivers, disk controllers, physical disk mechanisms, several levels of caching, buffering, and command queuing, etc. Caching aside, the two most important factors are seek time and bandwidth.</span></p>
<p class="emptyP"><span></span></p>
<p><span>It’s important to remember that these are independent. A storage system can have high bandwidth but poor seek time, or the reverse. For an extreme example, consider your DVD collection. (Some of you must remember DVDs.) A “seek” involves walking over to the shelf, grabbing a disk, inserting it into the player, and waiting for it to load &#8212; a long time! But once the disk has loaded, the player can stream data at fairly high bandwidth. At the opposite extreme, early generation digital camera memory cards had fast “seek” times, but limited bandwidth.</span></p>
<p class="emptyP"><span></span></p>
<p><span>With all this in mind, we performed two sets of benchmarks, each structured as follows:</span></p>
<p class="emptyP"><span></span></p>
<p><span>1. Allocate a number of identical EC2 instances. The remaining steps are executed in parallel on each instance.</span></p>
<p class="emptyP"><span></span></p>
<p><span>2. Create a single 80GB disk file, populated with random data. (80GB should be large enough to minimize cache effects, ensuring that we are measuring the performance of the underlying I/O system. Note that AWS may perform caching at a level we can’t control, so filesystem or kernel flags to disable caching are not sufficient. An 80GB file is our “nuke the site from orbit” approach to disabling caches.)</span></p>
<p class="emptyP"><span></span></p>
<p><span>3. Spin up a number of threads (T), each of which runs in a tight loop for a specified duration. For each pass through the loop, we select a random position in the file, synchronously read or write a fixed number of bytes at that position, and record the elapsed time for that operation.</span></p>
<p class="emptyP"><span></span></p>
<p><span>Step 3 is repeated multiple times, for various combinations of threadcount and operation. The operation is one of the following:</span></p>
<p class="emptyP"><span></span></p>
<ol start="1" style="list-style-type:disc;margin:0;padding:0;">
<li style="margin-left:36pt;padding-left:0;"><span>read4K: read 4KB, at a 4KB-aligned position.</span></li>
<li style="margin-left:36pt;padding-left:0;"><span>read4M: read 4MB, at a 4MB-aligned position.</span></li>
<li style="margin-left:36pt;padding-left:0;"><span>write4K: write 4KB of random data, at a 4KB-aligned position.</span></li>
<li style="margin-left:36pt;padding-left:0;"><span>write4M: write 4MB of random data, at a 4MB-aligned position.</span></li>
<li style="margin-left:36pt;padding-left:0;"><span>write4K/64: like write4K, but with fewer flushes (see below).</span></li>
</ol>
<p class="emptyP"><span></span></p>
<p><span>For write4K and write4M, the file was opened in writeFlush mode (each write is synchronously flushed to disk). For write4K/64, the file was opened in write mode (no synchronous flush), but after each write, we perform a flush with probability 1/64. In other words, for write4K/64, we allow writes to flow into the buffer cache and then occasionally flush them.</span></p>
<p class="emptyP"><span></span></p>
<p><span>All of this is repeated for eight different EC2 configurations. The configurations differ in EC2 instance type (m1.small, m1.medium, m1.large, or m1.xlarge), and disk arrangement:</span></p>
<p class="emptyP"><span></span></p>
<ol start="1" style="list-style-type:disc;margin:0;padding:0;">
<li style="margin-left:36pt;padding-left:0;"><span>“Ephemeral” &#8212; drives associated with an EC2 instance. For instance types with multiple instance drives (e.g. m1.large), the drives were joined using RAID0.</span></li>
<li style="margin-left:36pt;padding-left:0;"><span>“EBS” &#8212; a single EBS volume.</span></li>
<li style="margin-left:36pt;padding-left:0;"><span>“EBS4” &#8212; four EBS volumes, joined using RAID0.</span></li>
</ol>
<p class="emptyP"><span></span></p>
<p><span>We tested eight of the twelve possible combinations. I will refer to these using a shorthand, such as “small/ebs4” for an m1.small instance with four EBS volumes in a RAID0 arrangement, or “xlarge/ephemeral” for an m1.xlarge instance with its instance drives also in RAID0.</span></p>
<p class="emptyP"><span></span></p>
<p><span>For the first set of benchmarks, 30 instances of each configuration were used &#8212; a total of 240 instances. Each instance performed a series of 42 two-minute benchmark runs:</span></p>
<p class="emptyP"><span></span></p>
<ol start="1" style="list-style-type:disc;margin:0;padding:0;">
<li style="margin-left:36pt;padding-left:0;"><span>read4K: 10 separate runs, one each with T (threadcount) 1, 2, 4, 8, 12, 16, 24, 32, 48, and 64.</span></li>
<li style="margin-left:36pt;padding-left:0;"><span>read4M, write4K, write4M, write4K/64: 8 runs each, with T = 1, 2, 3, 4, 6, 8, 12, and 16.</span></li>
</ol>
<p class="emptyP"><span></span></p>
<p><span>Each instance performed these 42 runs in a different (random) order. This benchmark was primarily intended to explore how performance varies with threadcount.</span></p>
<p class="emptyP"><span></span></p>
<p><span>For the second set of benchmarks, 45 instances of each configuration were used &#8212; 360 instances in all. Each instance performed a series of 5 ten-minute benchmark runs: one for each of read4K, read4M, etc. Here, we used whatever threadcount was found, in the earlier benchmarks, to yield optimal throughput for that configuration and operation type. This benchmark was intended to provide a lower-variance view of performance across instances and instance types.</span></p>
<p class="emptyP"><span></span></p>
<p><span>Here are the threadcounts used in the second run. (Note that we incorporated a slight bias toward smaller threadcounts: we used the smallest value that yielded throughput within 5% of the maximum.)</span></p>
<p><a href="#"></a><br />
<a href="#"></a></p>
<table cellpadding="0" cellspacing="0" style="border-collapse:collapse;">
<tbody>
<tr>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:128.4pt;">
<p class="emptyP"><span style="font-size:11pt;"></span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:69.1pt;">
<p><span>read4K</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:71.3pt;">
<p><span>read4M</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:59.6pt;">
<p><span>write4K</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:77.1pt;">
<p><span>write4K/64</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:62.5pt;">
<p><span>write4M</span></p>
</td>
</tr>
<tr>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:128.4pt;">
<p><span>small/ephemeral</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:69.1pt;">
<p style="text-align:right;"><span>24</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:71.3pt;">
<p style="text-align:right;"><span>1</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:59.6pt;">
<p style="text-align:right;"><span>1</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:77.1pt;">
<p style="text-align:right;"><span>16</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:62.5pt;">
<p style="text-align:right;"><span>2</span></p>
</td>
</tr>
<tr>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:128.4pt;">
<p><span>small/ebs</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:69.1pt;">
<p style="text-align:right;"><span>48</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:71.3pt;">
<p style="text-align:right;"><span>6</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:59.6pt;">
<p style="text-align:right;"><span>3</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:77.1pt;">
<p style="text-align:right;"><span>2</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:62.5pt;">
<p style="text-align:right;"><span>3</span></p>
</td>
</tr>
<tr>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:128.4pt;">
<p><span>small/ebs4</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:69.1pt;">
<p style="text-align:right;"><span>32</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:71.3pt;">
<p style="text-align:right;"><span>16</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:59.6pt;">
<p style="text-align:right;"><span>12</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:77.1pt;">
<p style="text-align:right;"><span>2</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:62.5pt;">
<p style="text-align:right;"><span>1</span></p>
</td>
</tr>
<tr>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:128.4pt;">
<p><span>medium/ephemeral</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:69.1pt;">
<p style="text-align:right;"><span>24</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:71.3pt;">
<p style="text-align:right;"><span>1</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:59.6pt;">
<p style="text-align:right;"><span>1</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:77.1pt;">
<p style="text-align:right;"><span>6</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:62.5pt;">
<p style="text-align:right;"><span>2</span></p>
</td>
</tr>
<tr>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:128.4pt;">
<p><span>large/ephemeral</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:69.1pt;">
<p style="text-align:right;"><span>48</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:71.3pt;">
<p style="text-align:right;"><span>3</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:59.6pt;">
<p style="text-align:right;"><span>4</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:77.1pt;">
<p style="text-align:right;"><span>1</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:62.5pt;">
<p style="text-align:right;"><span>3</span></p>
</td>
</tr>
<tr>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:128.4pt;">
<p><span>large/ebs</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:69.1pt;">
<p style="text-align:right;"><span>8</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:71.3pt;">
<p style="text-align:right;"><span>12</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:59.6pt;">
<p style="text-align:right;"><span>4</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:77.1pt;">
<p style="text-align:right;"><span>4</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:62.5pt;">
<p style="text-align:right;"><span>2</span></p>
</td>
</tr>
<tr>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:128.4pt;">
<p><span>large/ebs4</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:69.1pt;">
<p style="text-align:right;"><span>16</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:71.3pt;">
<p style="text-align:right;"><span>4</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:59.6pt;">
<p style="text-align:right;"><span>16</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:77.1pt;">
<p style="text-align:right;"><span>12</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:62.5pt;">
<p style="text-align:right;"><span>4</span></p>
</td>
</tr>
<tr>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:128.4pt;">
<p><span>xlarge/ephemeral</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:69.1pt;">
<p style="text-align:right;"><span>32</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:71.3pt;">
<p style="text-align:right;"><span>2</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:59.6pt;">
<p style="text-align:right;"><span>12</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:77.1pt;">
<p style="text-align:right;"><span>1</span></p>
</td>
<td style="border-color:#000000;border-style:solid;border-width:1pt;padding:5pt;vertical-align:top;width:62.5pt;">
<p style="text-align:right;"><span>6</span></p>
</td>
</tr>
</tbody>
</table>
<p class="emptyP"><span style="font-size:11pt;"></span></p>
<p><span style="color:#222222;">All tests used ext3, with the default block size (4KB) and noatime option. RAID0 configurations used the default 512KB chunk size. No special attempt at filesystem tuning was made. We used the default Amazon Linux AMIs (ami-</span><span>41814f28 and ami-1b814f72).</span></p>
<p class="emptyP"><span></span></p>
<p><span>This work was done shortly before Amazon introduced solid-state storage, EBS-optimized instances, and provisioned IOPS. We may examine these options in a followup post.</span></p>
<h3><a></a><span>Thoughts on Benchmark Quality</span></h3>
<p><span>Since the dawn of time, </span><span style="font-style:italic;">repeatability</span><span>&nbsp;has been a critical topic for benchmarks. If you run the same benchmark twice, you often get different results. Cache state, background tasks, disk hardware glitches, and a thousand other variables come into play. For this reason, it’s always been good practice to run a benchmark several times and take the median result.</span></p>
<p class="emptyP"><span></span></p>
<p><span>In the cloud, this tendency is vastly amplified. Remember that histogram of bulk write throughput across EBS instances? The variation from best to worst instance wasn’t a few percent, it was thirty to one. Thirty to one! This means that a single-machine test could easily report that small instances are faster than xlarge instances, or any other sort of nonsense.<br />
</span></p>
<p class="emptyP"><span></span></p>
<p><span>Under these circumstances, for a cloud benchmark to have any validity, it <b>must</b> include data from many machine instances. Single-machine data is </span><span style="text-decoration:underline;">worse than useless</span><span>; you simply don’t know whether you’re measuring application behavior, or random cloud background noise. Don’t just take single-machine cloud benchmarks with a grain of salt: ignore them entirely.</span></p>
<p class="emptyP"><span></span></p>
<p><span>It’s also important to be very clear on what your benchmark is measuring. Application-level benchmarks are complex, and hard to generalize. Knowing how long a system takes to build the Linux kernel doesn’t tell me much about how it will handle MySQL. For that matter, MySQL performance on your workload may not say much about MySQL performance on my workload. So the gold standard is always to perform your own tests, using your actual application, under your actual workload. But failing that, the benchmarks with the greatest general applicability are those that measure basic system properties, such as I/O bandwidth.</span></p>
<p class="emptyP"><span></span></p>
<p><span>With all this in mind, I’d like to propose a set of guidelines, which I’ll call “Cloud Truth”, for cloud benchmarks. (By analogy to “ground truth”, which refers to the process of going to a location and taking direct measurements of some property, to calibrate or verify satellite images or other remote sensing. More broadly, “ground truth” represents the most direct possible measurement.) For a result to be considered Cloud Truth, it should:</span></p>
<p class="emptyP"><span></span></p>
<ol start="1" style="list-style-type:disc;margin:0;padding:0;">
<li style="margin-left:36pt;padding-left:0;"><span>Directly measure basic system properties</span></li>
<li style="margin-left:36pt;padding-left:0;"><span>Include measurements from many instances</span></li>
<li style="margin-left:36pt;padding-left:0;"><span>Over a substantial period of time</span></li>
<li style="margin-left:36pt;padding-left:0;"><span>Use clearly explained, reproducible methods (preferably including source code)</span></li>
<li style="margin-left:36pt;padding-left:0;"><span>Clearly explain what was measured, and how</span></li>
<li style="margin-left:36pt;padding-left:0;"><span>Clearly explain how the results were averaged, aggregated, or otherwise processed</span></li>
</ol>
<p class="emptyP"><span></span></p>
<p><span>This is not easy. (The results I’m presenting here fall short on “substantial period of time”.) But when reporting cloud benchmarks, these are the standards we must aspire to.</span></p>
<h3><a></a><span>Limitations / Future Work</span></h3>
<p><span>Considerable effort and expense went into these benchmarks, but there is still plenty of room to explore further. Some areas we haven’t yet touched on:</span></p>
<p class="emptyP"><span></span></p>
<ol start="1" style="list-style-type:disc;margin:0;padding:0;">
<li style="margin-left:36pt;padding-left:0;"><span>How instances behave over long periods of time</span></li>
<li style="margin-left:36pt;padding-left:0;"><span>New AWS I/O options (solid-state storage, EBS-optimized instances, and provisioned IOPS)</span></li>
<li style="margin-left:36pt;padding-left:0;"><span>Tuning filesystem and RAID configuration</span></li>
<li style="margin-left:36pt;padding-left:0;"><span>Impact of EBS snapshots</span></li>
<li style="margin-left:36pt;padding-left:0;"><span>Variations across AWS zones and regions</span></li>
<li style="margin-left:36pt;padding-left:0;"><span>Other cloud providers</span></li>
<li style="margin-left:36pt;padding-left:0;"><span>Single-tenant and/or non-cloud machines</span></li>
</ol>
<h3><a></a><span>One Last Plug</span></h3>
<p><span>If you’ve read this far, you probably take your engineering seriously. And if so, you’d probably like working at Scalyr. Why not find out? </span><span><a href="https://www.scalyr.com/jobs">https://www.scalyr.com/jobs</a></span></p>
<p class="emptyP"><span></span></p>
<p class="emptyP"><span></span></p>
<p><span style="font-style:italic;">If there’s interest, we’ll publish the raw data and source code for these benchmarks. Drop us a line at </span><span style="font-style:italic;"><a href="">contact@scalyr.com</a></span><span style="font-style:italic;">.</span></p>
<p class="emptyP"><span style="font-style:italic;"></span></p>
<p><span style="font-style:italic;">Thanks to </span><span style="color:#222222;font-style:italic;">Vibhu Mohindra, who did all the heavy lifting to implement and run these benchmarks. Also to Steven Czerwinski and Christian Stucchio for feedback on an early draft of this post.</span></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scalyr.com&#038;blog=30473437&#038;post=129&#038;subd=scalyr&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scalyr.com/2012/10/16/a-systematic-look-at-ec2-io/feed/</wfw:commentRss>
		<slash:comments>38</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/7b1ce707b69e2718c05e97dff2dc6daf?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">scalyr</media:title>
		</media:content>

		<media:content url="http://scalyr.files.wordpress.com/2012/10/iobench_tangle1.png?w=329" medium="image" />

		<media:content url="http://scalyr.files.wordpress.com/2012/10/iobench_costeffectiveness.png" medium="image" />

		<media:content url="http://scalyr.files.wordpress.com/2012/10/iobench_costeffectivenessbulk.png" medium="image" />

		<media:content url="http://scalyr.files.wordpress.com/2012/10/iobench_raidspeedup.png" medium="image" />

		<media:content url="http://scalyr.files.wordpress.com/2012/10/iobench_instancesizespeedup.png" medium="image" />

		<media:content url="http://scalyr.files.wordpress.com/2012/10/iobench_badapples1.png" medium="image" />

		<media:content url="http://scalyr.files.wordpress.com/2012/10/iobench_badapples2.png" medium="image" />

		<media:content url="http://scalyr.files.wordpress.com/2012/10/iobench_throughputbythreadcount.png" medium="image" />

		<media:content url="http://scalyr.files.wordpress.com/2012/10/iobench_latencybyinstance.png" medium="image" />

		<media:content url="http://scalyr.files.wordpress.com/2012/10/iobench_throughputovertime.png" medium="image" />

		<media:content url="http://scalyr.files.wordpress.com/2012/10/iobench_latencybyinstancetype.png" medium="image" />

		<media:content url="http://scalyr.files.wordpress.com/2012/10/iobench_histogram.png" medium="image" />
	</item>
		<item>
		<title>Introducing Scalyr Logs</title>
		<link>http://blog.scalyr.com/2012/10/09/introducing-scalyr-logs/</link>
		<comments>http://blog.scalyr.com/2012/10/09/introducing-scalyr-logs/#comments</comments>
		<pubDate>Tue, 09 Oct 2012 23:35:29 +0000</pubDate>
		<dc:creator>scalyr</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.scalyr.com/?p=124</guid>
		<description><![CDATA[Today we’re excited to announce a pair of new services from Scalyr: Scalyr Logs&#160;is a new approach to server monitoring and analysis. Traditionally, this has been treated as a series of special-case problems: timeseries/graphing, log search, external monitoring, dashboards, alerting, exception tracking, performance analysis, etc. In my career, I’ve had to juggle too many tools [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scalyr.com&#038;blog=30473437&#038;post=124&#038;subd=scalyr&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><span>Today we’re excited to announce a pair of new services from Scalyr:</span></p>
<p class="emptyP"><span></span></p>
<p><span><a href="https://www.scalyr.com/logs">Scalyr Logs</a></span><span>&nbsp;is a new approach to server monitoring and analysis. Traditionally, this has been treated as a series of special-case problems: timeseries/graphing, log search, external monitoring, dashboards, alerting, exception tracking, performance analysis, etc. In my career, I’ve had to juggle too many tools in an attempt to get a complete picture of a system&#8217;s behavior &#8212; and been frustrated at the disconnected, patchwork result. I&#8217;ve spent far too many hours trying to figure out which graph explains why my pager went off, or which logs might help me understand why an error graph just spiked, or taking random peeks into log files because I don&#8217;t have a tool that can analyze them in the way I need.</span></p>
<p class="emptyP"><span></span></p>
<p><span>Scalyr Logs is a </span><span style="font-weight:bold;">unified</span><span>, </span><span style="font-weight:bold;">enlightened</span><span>&nbsp;solution for understanding server behavior. At its heart is a data warehouse for event streams. The warehouse can accept traditional data types such as timeseries data or log files, as well as structured data such as exception reports or custom events. All this can then be searched, graphed, histogrammed, and otherwise analyzed. You can define parsing rules to extract structured data from unstructured logs, and then apply the full tool suite to the result. All of this in realtime &#8212; incoming events are immediately available for querying &#8212; and at interactive speeds. Under the hood, we&#8217;re using ideas borrowed from projects like Google’s </span><span><a href="http://research.google.com/pubs/pub36632.html">Dremel</a></span><span>&nbsp;and </span><span><a href="http://research.google.com/pubs/pub36356.html">Dapper</a></span><span>, and developing new techniques for data management and indexing that adapts to your usage.</span></p>
<p class="emptyP"><span></span></p>
<p><span>Scalyr Logs isn’t just about power and flexibility; it’s also about ease of use. As a service, there’s no backend to install or manage. As a unified system, there’s less to set up and less to learn. As a web site, we’re constantly iterating on our user interface and feature set in response to users.</span></p>
<p class="emptyP"><span></span></p>
<p><span><a href="https://www.scalyr.com/graphs">Scalyr Graphs</a></span><span>&nbsp;is a subset of Scalyr Logs, focused on timeseries graphing, dashboards, and alerts. It can import data from existing tools like Graphite and OpenTSDB, as well as custom events through our API. Emphasizing speed, scalability, and ease of use, Scalyr Graphs is designed to be a quick and easy solution if you&#8217;re outgrowing your existing graphing system, are tired of throwing hardware at the problem, tired of waiting for dashboards to load, or don&#8217;t want the hassle of running your own graph servers.</span></p>
<p class="emptyP"><span></span></p>
<p><span>Click the links above to learn more. We look forward to changing your view of what server monitoring can be!</span></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scalyr.com&#038;blog=30473437&#038;post=124&#038;subd=scalyr&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scalyr.com/2012/10/09/introducing-scalyr-logs/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/7b1ce707b69e2718c05e97dff2dc6daf?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">scalyr</media:title>
		</media:content>
	</item>
		<item>
		<title>The Azure Outage: Time Is a SPOF, Leap Day Doubly So</title>
		<link>http://blog.scalyr.com/2012/03/13/the-azure-outage-time-is-a-spof-leap-day-doubly-so/</link>
		<comments>http://blog.scalyr.com/2012/03/13/the-azure-outage-time-is-a-spof-leap-day-doubly-so/#comments</comments>
		<pubDate>Tue, 13 Mar 2012 15:36:37 +0000</pubDate>
		<dc:creator>scalyr</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.scalyr.com/?p=96</guid>
		<description><![CDATA[On the Scalyr blog, we sometimes post on topics relating to cloud computing in general. This is such a post. Microsoft’s Azure service suffered a widely publicized outage on February 28th / 29th. Microsoft recently published an excellent postmortem. For anyone trying to run a high-availability service, this incident can teach several important lessons. The [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scalyr.com&#038;blog=30473437&#038;post=96&#038;subd=scalyr&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><span style="font-style:italic;">On the Scalyr blog, we sometimes post on topics relating to cloud computing in general. This is such a post.</span></p>
<p class="emptyP"><span></span></p>
<p><span>Microsoft’s Azure service suffered a widely publicized outage on February 28th / 29th. Microsoft recently published an excellent </span><span><a href="http://blogs.msdn.com/b/windowsazure/archive/2012/03/09/summary-of-windows-azure-service-disruption-on-feb-29th-2012.aspx">postmortem</a></span><span>. For anyone trying to run a high-availability service, this incident can teach several important lessons.</span></p>
<p class="emptyP"><span></span></p>
<p><span>The central lesson is that, no matter how much work you put into redundancy, problems will arise. Murphy is strong and, I might say, </span><span style="font-style:italic;">creative</span><span>; things go wrong. So preventative measures are important, but how you react to problems is just as important. It’s interesting to review the Azure incident in this light.</span></p>
<p class="emptyP"><span></span></p>
<p><span>The postmortem is worth reading in its entirety, but here’s a quick summary: each time Azure launches a new VM, it creates a “transfer certificate” to secure communications with that VM. There was a bug in the code that determines the certificate expiration date, such that all VMs launched on February 29th (Leap Day) were inoperable. Beginning at 4:00 PM PST on February 28th (12:00 AM February 29th GMT), all Azure clusters worldwide were unable to launch new VMs. In the face of repeated VM failures, Azure mistakenly decided that machines were physically broken, and attempted to migrate healthy VMs off of them, compounding the problem. Identifying the bug, fixing it, and pushing the new build required roughly 13 hours.</span></p>
<p class="emptyP"><span></span></p>
<p><span>The outage was quite embarrassing for Azure, but Microsoft comes off fairly well in the postmortem. The root cause was of the “it could happen to anyone” variety. The calendar bug was dumb, but it’s the sort of subtle, one-off dumbness that can happen even to good engineers at good companies.</span></p>
<h3><a></a><span>Time is a single point of failure</span></h3>
<p><span>It is a commonplace that things will go wrong no matter how careful you are. This is why the mantra of reliable systems design is “no single point of failure” (SPOF) &#8212; use multiple power supplies, multiple copies of data, multiple servers, multiple data centers. Last April’s AWS outage was magnified by the fact that the supposedly-independent “availability zones” in Amazon’s us-east region turned out to share a SPOF in the EBS control servers. But Amazon’s other regions did not share control servers with us-east, and so were protected.</span></p>
<p class="emptyP"><span></span></p>
<p><span>I’m not especially familiar with Windows Azure, but it appears to follow good practice in this regard, using multiple data centers that are independent at both the hardware and software level. Yet the February 29th outage affected all regions. Why? Because the root cause was a bug that only manifests on leap days, and all regions entered Leap Day simultaneously. In other words, </span><span style="text-decoration:underline;">all regions share the same calendar, so the calendar is a SPOF</span><span>.</span></p>
<p class="emptyP"><span></span></p>
<p><span>This is hard to avoid (see: Y2K bug). You can distribute your data centers in space, but not in time; they’re all in the same “now”. (I see Dr. Einstein in the back&nbsp;raising an objection, but he’s out of order.) So how do you prevent time-related bugs from causing a global, correlated outage? There’s no great answer. You could run your servers on local time instead of GMT, but that’s messy, will probably cause more grief than it avoids, and at best it only spreads things out by a few hours. You could run a test cluster using a clock that’s set several days ahead, but that’s a lot of work to maintain.</span></p>
<p class="emptyP"><span></span></p>
<p><span>You can, at least, avoid making major changes on Leap Day. More generally, avoid rocking the boat on </span><span style="text-decoration:underline;">any</span><span>&nbsp;unusual occasion. Many companies have a policy to not push new builds, perform maintenance, etc. near a major holiday. Leap days, daylight savings transitions, and other calendar events may also be good occasions to leave things alone, as suggested in the </span><span><a href="http://news.ycombinator.com/item?id=3686950">Hacker News discussion</a></span><span>&nbsp;of the outage</span><span>. In this case, Microsoft was in the process of rolling out a new version of their server platform, which complicated the crisis.</span></p>
<h3><a></a><span>Response speed is critical</span></h3>
<p><span>You can’t always prevent problems, so it’s important that you quickly repair the problems that do occur. In this case, it took quite a while for Microsoft to sort things out. A timeline of the key events (all times PST):</span></p>
<p class="emptyP"><span></span></p>
<p><span>4:00 PM &#8212; bug first manifests; no new VMs can be created from this point.</span></p>
<p><span>5:15 PM &#8212; first wave of machines marked bad; alerts trigger.</span></p>
<p><span>6:38 PM &#8212; root cause identified.</span></p>
<p><span>10:00 PM &#8212; remediation plan complete.</span></p>
<p><span>11:20 PM &#8212; bugfix code ready.</span></p>
<p><span>1:50 AM &#8212; bugfix code tested in a test cluster; production rollout begins.</span></p>
<p><span>2:11 AM &#8212; fix completely pushed to one production cluster.</span></p>
<p><span>5:23 AM &#8212; fix pushed to most clusters, Microsoft announces that the majority of clusters are healthy again.</span></p>
<p class="emptyP"><span></span></p>
<p><span>In all, thirteen hours for what sounds like a one-line fix. If Microsoft had been able to respond more quickly, the impact could have been considerably reduced.</span></p>
<p class="emptyP"><span></span></p>
<p><span>From the outside, it’s hard to second-guess the details of Microsoft’s response. But it’s worth asking yourself: in an emergency, how long would it take for </span><span style="text-decoration:underline;">you</span><span>&nbsp;to produce a new build, run some basic tests, and push the fix into production? Protip: if you haven’t actually done it, you don’t know the answer. It’s a good idea to go through the exercise, and clearly document the precise steps involved, bearing in mind that your junior engineer may someday be following those instructions in a 3:00 AM daze.</span></p>
<h3><a></a><span>In a crisis, keep things simple</span></h3>
<p><span>When the crisis hit, Microsoft was almost done rolling out a new release of the server platform, but seven clusters had only just started deploying it. When pushing the fix, Microsoft decided to revert these clusters to the old release. This meant creating a build of the old release with the Leap Day bugfix. This build was done incorrectly, incorporating a mix of old and new components that did not work together. When Microsoft pushed the bad build, all servers in those seven clusters went offline.</span></p>
<p class="emptyP"><span></span></p>
<p><span>Given that most clusters were already running the new release, and even these seven clusters had already started receiving it, it might have been better to use the new release everywhere and avoid the extra work of building and testing a Leap Day fix for the old release. Perhaps some factor not mentioned in the postmortem ruled out this approach. But in general, actions performed during a crisis should be kept as simple as possible. Don’t make a new build if you can muddle through with a configuration tweak; don’t make two new builds if you can get by with one.</span></p>
<h3><a></a><span>Avoid compounding mistakes</span></h3>
<p><span>Because the Leap Day fix to the old release was felt to be safe, and the build had passed some quick tests, Microsoft decided to bypass their normal slow-roll procedure and “blast” it to all servers on all seven clusters simultaneously. The result was a catastrophic outage in those clusters, as well as “a number of servers … in corrupted states as a result of the various transitions”.</span></p>
<p class="emptyP"><span></span></p>
<p><span>In a crisis, there’s always a huge temptation to take shortcuts. And sometimes they’re necessary. But it’s important to keep a careful eye on the tradeoffs involved. In this case, at 2:47 AM, nine and a half hours after the first alerts fired, the team was probably running on caffeine and fumes. That point, with everyone exhausted and the finish line in sight, is when mistakes are most likely to happen. These mistakes can cause problems worse than the original incident. So it’s important not to race ahead too quickly, and to keep an eye on the risks involved in each action you take.</span></p>
<h3><a></a><span>Rate limit dangerous actions</span></h3>
<p><span>When a server crashes repeatedly, Azure marks the machine as bad and migrates VMs to other machines. The Leap Day bug caused this to happen to every machine that tried to launch a new VM, causing healthy VMs to be migrated off of those machines and triggering a failure cascade.</span></p>
<p class="emptyP"><span></span></p>
<p><span>When a certain number of machines were marked bad, Azure entered an emergency mode and stopped attempting to migrate VMs. This is an excellent defensive measure; without it, the entire Azure platform might have gone down. Such a cascade effect was at the heart of the April AWS outage. Kudos to Microsoft for having a cap in place.</span></p>
<p class="emptyP"><span></span></p>
<p><span>There’s a general design principle here: </span><span style="font-weight:bold;">rate limit dangerous actions</span><span>. Marking a machine bad is potentially dangerous, as it reduces the cluster’s capacity and is disruptive to VMs on that server. Some bad servers are to be expected in normal operation, but if many servers are being marked bad then something deeper may be wrong, and it’s best to do nothing and request manual intervention.</span></p>
<p class="emptyP"><span></span></p>
<p><span>Another place this arises is data deletion. I have seen a major production service experience a bug that caused it to begin madly deleting database records. By the time someone noticed the problem, catastrophic damage had been done. The team was able to recover the data, but only through extreme effort and some luck. A cap, or at least an alert, on the rate of record deletion would have caught the problem much sooner.</span></p>
<h3><a></a><span>Milk each crisis for every lesson you can</span></h3>
<p><span>It’s </span><span>obvious that the root cause of a crisis &#8212; in this case, the faulty code for generating expiration dates &#8212; should be fixed. The Microsoft postmortem goes well beyond this, listing a dozen measures they have identified to better detect bugs before they trigger in production, increase the system’s resilience, improve their ability to repair problems quickly, and improve communication with customers during a crisis.</span></p>
<p class="emptyP"><span></span></p>
<p><span>Every crisis carries multiple lessons. Like Microsoft, you should attempt to learn as many as possible. The more you learn from each crisis, the fewer “educational” crises you’ll have to suffer through.</span></p>
<p class="emptyP"><span></span></p>
<p class="emptyP"><span></span></p>
<p><span style="font-style:italic;">If you enjoy no-fluff articles on practical techniques in cloud computing, distributed systems, and high availability, please </span><span style="font-style:italic;"><a href="http://feeds.feedburner.com/scalyr">subscribe to this blog</a></span><span style="font-style:italic;">.</span></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scalyr.com&#038;blog=30473437&#038;post=96&#038;subd=scalyr&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scalyr.com/2012/03/13/the-azure-outage-time-is-a-spof-leap-day-doubly-so/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/7b1ce707b69e2718c05e97dff2dc6daf?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">scalyr</media:title>
		</media:content>
	</item>
		<item>
		<title>Transparency in Cloud Services</title>
		<link>http://blog.scalyr.com/2012/01/03/transparency-in-cloud-services/</link>
		<comments>http://blog.scalyr.com/2012/01/03/transparency-in-cloud-services/#comments</comments>
		<pubDate>Wed, 04 Jan 2012 01:57:15 +0000</pubDate>
		<dc:creator>scalyr</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.scalyr.com/?p=83</guid>
		<description><![CDATA[37signals recently launched public “Uptime Reports” for their applications (announcement). The reaction&#160;on Hacker News was rather tepid, but I think it’s a positive development, and I applaud 37signals for stepping forward. Reliability of cloud applications is a real concern, and there’s not nearly enough hard data out there. Not all products are equally reliable; even [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scalyr.com&#038;blog=30473437&#038;post=83&#038;subd=scalyr&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><span>37signals recently launched public “</span><span><a href="http://basecamphq.com/uptime">Uptime Reports</a></span><span>” for their applications (</span><span><a href="http://37signals.com/svn/posts/3067-lets-get-honest-about-uptime">announcement</a></span><span>). The </span><span><a href="http://news.ycombinator.com/item?id=3416154">reaction</a></span><span>&nbsp;on Hacker News was rather tepid, but I think it’s a positive development, and I applaud 37signals for stepping forward. Reliability of cloud applications is a real concern, and there’s not nearly enough hard data out there. Not all products are equally reliable; even within 37signals, the new reports show a 3:1 variation in downtime across apps.</span></p>
<p class="emptyP"><span></span></p>
<p><span>That said, this is a fairly small step. For comparison, take a look at our </span><span><a href="https://www.scalyr.com/monitor">public monitoring dashboard</a></span><span>&nbsp;here at </span><span><a href="https://www.scalyr.com">Scalyr</a></span><span>. This is a work in progress, and not nearly as pretty as the 37signals page, but it contains far more information. As a user of a cloud application, I would like to know:</span></p>
<p class="emptyP"><span></span></p>
<ol start="1" style="list-style-type:decimal;margin:0;padding:0;">
<li style="margin-left:36pt;padding-left:0;"><span>Evaluation: how reliable is it?</span></li>
<li style="margin-left:36pt;padding-left:0;"><span>Diagnosis: if I’m experiencing a problem, is it at my end or their end?</span></li>
</ol>
<p class="emptyP"><span></span></p>
<p><span>The 37signals uptime reports begin to address the evaluation goal, by reporting uptime over the last 12 months. If you’re only allowed to ask for one number, that’s probably the right one. It’s certainly a lot better than nothing, which is what you get from most providers. But it leaves many unanswered questions.</span></p>
<p class="emptyP"><span></span></p>
<p><span>The heart of the problem is that “up” is not a binary variable. An application can be working for some users but not others. It might be flaky &#8212; say, failing 5% of requests, or taking an extra 10 seconds to respond &#8212; in a way that drives users nuts, but doesn’t trip an outage detector. Or a critical feature might be broken. These things all happen, and they go to the heart of evaluating the reliability of a cloud application. But they tend to not meet the definition of “downtime”.</span></p>
<p class="emptyP"><span></span></p>
<p><span>Then there’s the question of diagnosis. When an application starts misbehaving, is it because my network connection is flaky? Do I need to restart the browser? Or is it a problem at the provider’s end? An application health dashboard can help answer those questions, but only if it provides real-time data. The 37signals report, with one-day resolution, is not useful for diagnosis.</span></p>
<p class="emptyP"><span></span></p>
<p><span style="font-weight:bold;">Transparency in backend services</span></p>
<p class="emptyP"><span></span></p>
<p><span>At Scalyr, we’re building backend services, not user-facing applications. In a backend service, transparency is doubly important. Problems that a human user would consider a minor annoyance can wreak havoc on a downstream service. For instance, suppose latency jumps from 100ms to 400ms. This could cause the downstream service to have four times as many requests in flight, increasing memory usage by 4x. If the extra memory isn’t available, the server might crash, turning a minor latency hiccup into a complete outage.</span></p>
<p class="emptyP"><span></span></p>
<p><span>This may sound extreme, but such things happen. So as a user of a backend service, I want a lot more than just an annual uptime statistic. I want to see latency histograms and error rates over time, at fine granularity, for each operation the service provides. This is the kind of information you’ll find on the Scalyr dashboard.</span></p>
<p class="emptyP"><span></span></p>
<p><span>With backend services, diagnosis is also much more important. If you’re relying on multiple internal and external services, you need a way to quickly narrow down problems. A real-time monitoring dashboard for each service is invaluable. Happily, the data needed is the same &#8212; latency histograms and error rates.</span></p>
<p class="emptyP"><span></span></p>
<p><span style="font-weight:bold;">Transparency is not just about data</span></p>
<p class="emptyP"><span></span></p>
<p><span>Performance data tells you how a service has performed in the past, but that doesn’t always predict the future. A seemingly reliable service may have been </span><span><a href="http://groups.google.com/group/mongodb-user/browse_thread/thread/528a94f287e9d77e">on the edge of disaster</a></span><span>; a previously unreliable service may have recently made improvements. To fully evaluate a service, you should also look for an architectural overview. Is it designed for reliability? Are there scaling limits? Single points of failure? How is failover implemented?</span></p>
<p class="emptyP"><span></span></p>
<p><span>The </span><span><a href="http://highscalability.com/blog/2011/4/25/the-big-list-of-articles-on-the-amazon-outage.html">April 2011 AWS outage</a></span><span>&nbsp;provides a good example. One of Amazon’s claims for AWS is that the “availability zones” in each region are decoupled &#8212; failures in one shouldn’t affect another. As far as I know, this claim had borne out for several years. However, the April outage affected EBS services across all zones in the US-East-1 region. According to Amazon’s postmortem, this was because an important subsystem &#8212; the “EBS control plane” &#8212; was in fact shared across all zones in the region. Thus, a hardware problem in one zone ultimately affected the other zones.</span></p>
<p class="emptyP"><span></span></p>
<p><span>Quite a few sites went down because they had built their disaster recovery plans around Amazon’s promise of zone independence. Monitoring data prior to April 2011 would not have given any hint of this vulnerability. Only if Amazon had published the architectural details of EBS would customers have known to prepare for the possibility of simultaneous failure of EBS in multiple zones.</span></p>
<p class="emptyP"><span></span></p>
<p><span>I don’t mean to pick on Amazon here; it simply happens that, due to their size, they provide a convenient example. To their credit, they did publish a </span><span><a href="http://aws.amazon.com/message/65648/">detailed postmortem</a></span><span>&nbsp;&#8211; another important form of transparency. By contrast, while 37signals reports that Basecamp was down for roughly 6 hours over the last year, I don’t see any postmortems on their site. Postmortems provide an important window into a service’s inner workings, the professionalism of the team, and the likelihood of repeat problems.</span></p>
<p class="emptyP"><span></span></p>
<p><span style="font-weight:bold;">Collateral benefits of transparency</span></p>
<p class="emptyP"><span></span></p>
<p><span>Transparency is not just about trust; it also helps to set expectations. What performance can I expect from this service? Is the latency my benchmark just reported likely to remain consistent over time? How will it change if I store 100 times more data, or change my query pattern? These questions are critical when designing your downstream application.</span></p>
<p class="emptyP"><span></span></p>
<p><span>Transparency is also of value to the community at large: illustrating what real-world production environments look like, and providing points of comparison.</span></p>
<p class="emptyP"><span></span></p>
<p><span>At Scalyr, we’re striving to raise the bar on service transparency. Our current </span><span><a href="https://www.scalyr.com/monitor">dashboard</a></span><span>&nbsp;is just a start. Over the next few months, look for detailed information regarding our internal architecture, even more detailed (and better documented) monitoring data, and a continuing series of posts on the challenges involved in running reliable services.</span></p>
<p class="emptyP"><span></span></p>
<p><span>If you like these ideas, </span><span><a href="https://www.scalyr.com/">check out our first service</a></span><span>, and stay tuned to this blog for bigger things to come!</span></p>
<p class="emptyP"><span></span></p>
<p><span style="font-weight:bold;">Call to action</span></p>
<p class="emptyP"><span></span></p>
<p><span>If you run a service, and you publish any sort of detailed uptime, performance, or architectural information, I’d love to hear from you &#8212; drop me a line at </span><span><a href="mailto:steve@scalyr.com">steve@scalyr.com</a></span><span>. I’ll collect any interesting examples for a future post.</span></p>
<p class="emptyP"><span></span></p>
<p><span>If you use a service, and you have interesting examples where transparency has come in handy &#8212; or the lack of it has bitten you &#8212; send me those, too.</span></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scalyr.com&#038;blog=30473437&#038;post=83&#038;subd=scalyr&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scalyr.com/2012/01/03/transparency-in-cloud-services/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/7b1ce707b69e2718c05e97dff2dc6daf?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">scalyr</media:title>
		</media:content>
	</item>
		<item>
		<title>Introducing Scalyr</title>
		<link>http://blog.scalyr.com/2011/12/21/introducing-scalyr/</link>
		<comments>http://blog.scalyr.com/2011/12/21/introducing-scalyr/#comments</comments>
		<pubDate>Thu, 22 Dec 2011 01:15:48 +0000</pubDate>
		<dc:creator>scalyr</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.scalyr.com/?p=73</guid>
		<description><![CDATA[Welcome to the Scalyr blog. Today we’re announcing our first service, Knobs. What’s a Knob, you may ask? Or perhaps, what’s a Scalyr? First, a little background. I’ve spent a good chunk of my career developing “in the cloud”. (Building Writely, for instance &#8212; aka Google Docs.) It can be an amazing experience. With the [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scalyr.com&#038;blog=30473437&#038;post=73&#038;subd=scalyr&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><span>Welcome to the Scalyr blog. Today we’re announcing our first service, </span><span><a href="http://www.scalyr.com/knobs">Knobs</a></span><span>.</span></p>
<p class="emptyP"><span></span></p>
<p><span>What’s a Knob, you may ask? Or perhaps, what’s a Scalyr?</span></p>
<p class="emptyP"><span></span></p>
<p><span>First, a little background. I’ve spent a good chunk of my career developing “in the cloud”. (Building </span><span><a href="http://en.wikipedia.org/wiki/Google_Docs">Writely</a></span><span>, for instance &#8212; aka Google Docs.) It can be an amazing experience. With the variety and sophistication of services available today, I sometimes feel like I’m programming with seven-league boots. One day, you wake up to find that thousands or millions of people are using your work.</span></p>
<p class="emptyP"><span></span></p>
<p><span>However, building on cloud services can also be frustrating. Performance can be unpredictable, error messages unhelpful, protocols confusing. Sometimes they </span><span><a href="http://highscalability.com/blog/2011/4/25/the-big-list-of-articles-on-the-amazon-outage.html">go down</a></span><span>. As you scramble to cope, you can’t help but picture those thousands of people glaring at an error page and silently cursing. Cursing you, probably, even if they don’t know who you are. Sometimes you can work around the problem; sometimes all you can do is glare at the error page and add your own curse to the silent chorus.</span></p>
<p class="emptyP"><span></span></p>
<p><span>At Scalyr, we’re building a new breed of cloud services. Services architected for reliability, so you can depend on them. For transparency, so you know what kind of performance and behavior to expect. For simplicity and practicality, so you can integrate quickly and get on with your work. You’ll hear more about all of these themes in future posts.</span></p>
<p class="emptyP"><span></span></p>
<p><span>On to Knobs. For almost as long as there has been code, there have been knobs to tweak. These take many forms &#8212; configuration files, command-line parameters, constants, “magic cookies”. If you’ve written server code, you’ve wrestled with this. You need to specify a threadpool size, or a server address, or some other little constant. You know it might need tweaking, so you put it in a configuration file. And write code to parse the file. And a little script to copy the file to the server. And another script to restart all your servers so they can pick up the change. Oops &#8212; let’s tweak that script to only restart one server at a time! OK, problem solved&#8230; until all those copied files inevitably get out of sync, or you get tired of waiting for a rolling server restart every time you tweak a parameter.</span></p>
<p class="emptyP"><span></span></p>
<p><span>Knobs is a simple service to address this problem. We store configuration files for you; you edit them in a web page, or via our API. We give you a library that lets you read values with a single call. We take care of the rest, managing files and instantly copying updates to all of your servers.</span></p>
<p class="emptyP"><span></span></p>
<p><span>For reliability, we run servers in multiple facilities (of course). Furthermore, the Knobs library maintains a persistent cache of your configuration files on each server. So even if we were to have an outage, you won’t: the library will use the local cache.</span></p>
<p class="emptyP"><span></span></p>
<p><span>If this sounds interesting, </span><span><a href="https://www.scalyr.com/knobs">learn more</a></span><span>&nbsp;or just </span><span><a href="https://www.scalyr.com/knobStart">dive in.</a></span><span>&nbsp;And if you love the idea of building services that people can really depend on, </span><span><a href="https://www.scalyr.com/contact">drop us a line</a></span><span>&nbsp;&#8211; we’re hiring!</span></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.scalyr.com&#038;blog=30473437&#038;post=73&#038;subd=scalyr&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.scalyr.com/2011/12/21/introducing-scalyr/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/7b1ce707b69e2718c05e97dff2dc6daf?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">scalyr</media:title>
		</media:content>
	</item>
	</channel>
</rss>
