Even Stranger than Expected: a Systematic Look at EC2 I/O

At Scalyr, we’re building a large-scale storage system for timeseries and log data (see Introducing Scalyr Logs). To make good design decisions, we need hard data about EC2 I/O performance.

Plenty of data has been published on this topic, but we couldn’t really find the answers we needed. Most published data is specific to a particular application or EC2 configuration, or was collected from a small number of instances and hence is statistically suspect. (More on this below.)

Since the data we wanted wasn’t readily available, we decided to collect it ourselves. For the benefit of the community, we’re presenting our results here. These tests involved over 1000 EC2 instances, $1000 in AWS charges, and billions of I/O operations.

Stranger Than You Thought

This graph plots performance over time, for 45 EC2 instances performing a simple steady-state benchmark. The fluctuations indicate that performance of an individual instance can vary widely over time. If you look carefully, you can see that some instances are much more stable and better-performing than others.

Analyzing the data, we found many patterns. Some things, such as the variations in performance, we’d anticipated. Others came as a surprise; the relationship between instance size, storage type, and performance is more complex than we’d previously seen reported. In this post, we present a variety of findings that we hope are of interest to anyone interested in cloud computing, and that may help you to make better design decisions and avoid performance pitfalls.

I’ll discuss methodology in a later section, but here are the Cliff’s Notes: we tested small reads and writes (“read4K” and “write4K”), large synchronous reads and writes (“read4M” and “write4M”), and small mostly-asynchronous writes (“write4K/64”). We tested a variety of EC2 instance sizes, using instance storage or EBS, on single drives or RAID-0 groups. Each combination was repeated on dozens of EC2 instances.

Cost Effectiveness

This chart shows which configurations give the best bang for the buck — operations per dollar:

These figures reflect EC2 hourly rates and EBS I/O charges, but not EBS monthly storage fees (which aren’t affected by usage). Rates are for on-demand instances in the us-east region. Reserved or spot instances would reduce EC2 charges substantially, but not EBS charges, meaning that non-EBS instances would look better on the chart. The next chart shows cost effectiveness for bulk transfers:

Here, ephemeral storage has a huge advantage, which reserved instances would only amplify.

Impact of RAID

It’s widely held that the best EBS performance is obtained by RAIDing together multiple drives. Here’s what we found:

Each bar represents the throughput ratio of ebs4 to ebs on one specific test. In other words, the speedup moving from one to four EBS volumes. Blue reflects m1.small instances, red is m1.large. Each quantity represents an average across 45 instances. RAID offered a substantial benefit for small operations (especially reads), but — surprisingly — not much for bulk transfers. (Note, we did not make much attempt at tuning our filesystem or RAID configuration. See the Methodology section.)

Impact of Instance Size

Amazon states that larger EC2 instances have “higher” I/O performance, but they don’t quantify that. Our data:

For ephemeral storage, m1.medium was hardly better than m1.small, but m1.large and m1.xlarge show a substantial benefit. (The lackluster performance of m1.medium is not surprising: it has the same number of instance drives as m1.small — one — and the same advertised I/O performance, “low”.)

For EBS, m1.large shows little benefit over m1.small.

Shameless plug: if you’ve read this far, you’re probably doing interesting things in the cloud. If you’re doing interesting things, you have “interesting” monitoring challenges. And in that case, you’re just the sort of person we had in mind when we built Scalyr Logs. Check out the blog post and the product page, and register here if you’d like to try it out.

Bad Apples

You often hear that EC2 I/O is subject to a “bad apple” effect: some instances have markedly poor I/O performance, and you can get big gains by detecting these instances and moving off of them. We found that this effect is real, but applies much more strongly to some use cases than others. Consider the following two charts:

These are performance histograms: the horizontal axis shows operations per second, and the vertical axis shows the number of instances exhibiting a particular level of performance. A tall, narrow histogram indicates performance that is consistent across instances. Note that the horizontal axis uses a log scale.

The first histogram is for bulk writes (write4M) on the small/ebs configuration. 45 instances were tested; the histogram presents the mean throughput for each of those 45 instances. So, the slowest instance sustained roughly 0.3 writes per second (1.2MB/sec write bandwidth), while the fastest sustained a bit over 10 writes/second (40MB/sec) — a difference of more than 30x! Most instances were clustered around 7 writes/second, but 5 out of 45 managed 0.8 / second or less.

The second histogram is exactly the same, but for read4K operations on medium/ephemeral instances. Here, all 45 instances fell in the range of 200 to 300 operations/second; a best/worst ratio of just 1.5 : 1.

So, if you’re doing bulk writes on EBS, you probably need to worry about bad instances. Small reads on EC2 instance storage, not so much. In general, bulk transfers (read4M, write4M) show more variation across instances than small random I/Os (read4K, write4K, write4K/64), and EBS shows more variation than ephemeral storage, but there are exceptions. You’ll find systematic results in a later section (“Variation Across Instances”).

Impact of Parallelism

This chart shows throughput as a function of thread count. Each graph shows results for a particular operation on a particular storage type. Each line shows a particular EC2 configuration.

We can see that parallelism often improves throughput, but diminishing returns set in quickly. For all operations except read4K, and write4K on ebs4, 5 threads are enough. In fact, a single thread is enough for good throughput in many cases. But in some circumstances, small reads can benefit from as many as 48 threads.

Interestingly, for large reads on m1.medium / ephemeral, throughput drops, quite dramatically, when more than one thread is used. This effect held up consistently throughout multiple test runs, each on its own fresh set of 30 instances, on multiple days. The fact that we only observed this effect only on m1.medium serves to highlight the importance of testing the exact configuration you plan to use.

We can also see in this chart that EBS offers inexplicably good performance for small writes. For instance, write4K on large/ebs executes over 800 operation/second with a single thread. This implies a mean latency of roughly one millisecond — barely enough time for a network roundtrip to an EBS server. Either Amazon is doing something very clever, or EBS does not actually wait for durability before acknowledging a flush command.

Variation Across Instances

This chart shows how performance varies across 45 nominally identical instances. Each graph presents one benchmark. The vertical axis plots latency, and the horizontal axis plots the individual instances, sorted from best to worst. The five lines represent various latency percentiles, as indicated by the color key. If all instances behaved identically, the graphs would be quite flat, especially as we’re using a log scale.

Remember that the horizontal axis shows EC2 instances, not time. A sloped graph indicates that some instances were faster than others. Continuous slopes indicate gradual variations, while spikes indicate that some instances behaved very differently.

We can see that ephemeral storage latency is fairly consistent in general, though it’s not uncommon to have a few outliers. (For instance, for read4K on m1.xlarge / ephemeral, one instance appears to have mean latency more than 4x worse than the best instance.) Our sample size is too small to properly judge the prevalence of these outliers. Also note that bulk transfers show more variation than small operations.

EBS shows more variation, except for write4K/64. writeFlush operations (write4K and write4M) are especially bad, with 10:1 variations the norm.

Note that larger instances don’t always appear faster here, because we’re working them harder (more threads), and these graphs show latency rather than throughput.

Performance over time (A Twisty Maze of Instances, All Different)

Next, we examine how throughput varies over time. Each graph contains 45 lines, one per EC2 instance. The vertical axis is throughput, and the horizontal axis is time. The total time span is 10 minutes. This is not long enough to show long-term trends, but with 45 test instances, there is some scope for infrequent events to manifest.

The vertical axis on each graph is normalized to the maximum sample for that graph, so you should not attempt to compare values across graphs.

To my mind, this is the most remarkable chart in the entire investigation, because no two graphs look alike. If throughput were consistently stable over across instance and time, each graph would be a single thin bar. Instead, we see collections of horizontal lines (indicating performance that varies from instance to instance, but is steady over time); widely spaced horizontal lines (more variance between instances); wiggly lines (performance oscillating within a stable band); wild swings; gradual upward or downward motion; high-performing outliers; low-performing outliers; bimodal distributions; and more.

There do seem to be a few general trends. EBS is noisier than ephemeral (EC2 instance) disk, which is is not surprising — EBS has more moving parts. And on EBS, write performance is more variable than read performance. (Which is not to say that EBS reads are more consistently fast. It’s more that reads are consistently slow, while EBS writes are usually-but-not-reliably faster.)

We see confirmation here that in some cases there are “good” and “bad” instances, but we don’t have enough data to determine whether this is stable over long periods of time — whether good instances remain good, and bad instances remain bad.

Single-threaded latency

This chart shows operation latency. Most of results we’ve examined a thread count that optimizes throughput, but here we use a single thread, minimizing latency. The X axis indicates which instance setup was used (see the legend at the bottom of the chart), and the Y axis shows latency in milliseconds. Values are based on aggregate performance across 30 instances. The left column shows median, mean, and 90th percentile latency; the right column shows 99th and 99.9th percentile. I don’t present results for bulk transfers (read4M and write4M), as these are inherently throughput-oriented.

We can see that small reads (read4K) take roughly 10ms on every machine configuration. This makes sense, as the benchmark is designed to force a disk seek for every read, and seek time is likely to dominate other factors such as network latency. (There is a slight decrease in read latency on larger instances. That’s probably a caching artifact — m1.xlarge instances have enough RAM to cache roughly 20% of our 80GB file.)

The write4K results seem difficult to explain. For instance, why do larger instances show such drastically lower latency? And how can writes possibly be made durable so quickly, especially for EBS where a network hop is involved?

For write4K/64, the median and 90th percentile latencies hug the floor, which make sense as over 98% of these operations are not synchronously flushed and hence don’t wait for disk. The mean and higher percentiles follow roughly the same pattern as write4K.

For a final bit of fun, let’s look at a detailed histogram for one benchmark:

This shows latency for write4K operations on the xlarge/ephemeral configuration. The horizontal axis shows latency (log scale), and the vertical axis shows the number of individual operations with that particular latency. Each spike presumably indicates a distinct scenario — cache hit; cache miss; I/O contention with other tenants of the physical machine; etc. Clearly, several mechanisms are coming into play, but it’s not obvious to me what they all might be.

Methodology

“I/O performance” is a complex topic, involving the filesystem, device drivers, disk controllers, physical disk mechanisms, several levels of caching, buffering, and command queuing, etc. Caching aside, the two most important factors are seek time and bandwidth.

It’s important to remember that these are independent. A storage system can have high bandwidth but poor seek time, or the reverse. For an extreme example, consider your DVD collection. (Some of you must remember DVDs.) A “seek” involves walking over to the shelf, grabbing a disk, inserting it into the player, and waiting for it to load — a long time! But once the disk has loaded, the player can stream data at fairly high bandwidth. At the opposite extreme, early generation digital camera memory cards had fast “seek” times, but limited bandwidth.

With all this in mind, we performed two sets of benchmarks, each structured as follows:

1. Allocate a number of identical EC2 instances. The remaining steps are executed in parallel on each instance.

2. Create a single 80GB disk file, populated with random data. (80GB should be large enough to minimize cache effects, ensuring that we are measuring the performance of the underlying I/O system. Note that AWS may perform caching at a level we can’t control, so filesystem or kernel flags to disable caching are not sufficient. An 80GB file is our “nuke the site from orbit” approach to disabling caches.)

3. Spin up a number of threads (T), each of which runs in a tight loop for a specified duration. For each pass through the loop, we select a random position in the file, synchronously read or write a fixed number of bytes at that position, and record the elapsed time for that operation.

Step 3 is repeated multiple times, for various combinations of threadcount and operation. The operation is one of the following:

  1. read4K: read 4KB, at a 4KB-aligned position.
  2. read4M: read 4MB, at a 4MB-aligned position.
  3. write4K: write 4KB of random data, at a 4KB-aligned position.
  4. write4M: write 4MB of random data, at a 4MB-aligned position.
  5. write4K/64: like write4K, but with fewer flushes (see below).

For write4K and write4M, the file was opened in writeFlush mode (each write is synchronously flushed to disk). For write4K/64, the file was opened in write mode (no synchronous flush), but after each write, we perform a flush with probability 1/64. In other words, for write4K/64, we allow writes to flow into the buffer cache and then occasionally flush them.

All of this is repeated for eight different EC2 configurations. The configurations differ in EC2 instance type (m1.small, m1.medium, m1.large, or m1.xlarge), and disk arrangement:

  1. “Ephemeral” — drives associated with an EC2 instance. For instance types with multiple instance drives (e.g. m1.large), the drives were joined using RAID0.
  2. “EBS” — a single EBS volume.
  3. “EBS4” — four EBS volumes, joined using RAID0.

We tested eight of the twelve possible combinations. I will refer to these using a shorthand, such as “small/ebs4” for an m1.small instance with four EBS volumes in a RAID0 arrangement, or “xlarge/ephemeral” for an m1.xlarge instance with its instance drives also in RAID0.

For the first set of benchmarks, 30 instances of each configuration were used — a total of 240 instances. Each instance performed a series of 42 two-minute benchmark runs:

  1. read4K: 10 separate runs, one each with T (threadcount) 1, 2, 4, 8, 12, 16, 24, 32, 48, and 64.
  2. read4M, write4K, write4M, write4K/64: 8 runs each, with T = 1, 2, 3, 4, 6, 8, 12, and 16.

Each instance performed these 42 runs in a different (random) order. This benchmark was primarily intended to explore how performance varies with threadcount.

For the second set of benchmarks, 45 instances of each configuration were used — 360 instances in all. Each instance performed a series of 5 ten-minute benchmark runs: one for each of read4K, read4M, etc. Here, we used whatever threadcount was found, in the earlier benchmarks, to yield optimal throughput for that configuration and operation type. This benchmark was intended to provide a lower-variance view of performance across instances and instance types.

Here are the threadcounts used in the second run. (Note that we incorporated a slight bias toward smaller threadcounts: we used the smallest value that yielded throughput within 5% of the maximum.)


read4K

read4M

write4K

write4K/64

write4M

small/ephemeral

24

1

1

16

2

small/ebs

48

6

3

2

3

small/ebs4

32

16

12

2

1

medium/ephemeral

24

1

1

6

2

large/ephemeral

48

3

4

1

3

large/ebs

8

12

4

4

2

large/ebs4

16

4

16

12

4

xlarge/ephemeral

32

2

12

1

6

All tests used ext3, with the default block size (4KB) and noatime option. RAID0 configurations used the default 512KB chunk size. No special attempt at filesystem tuning was made. We used the default Amazon Linux AMIs (ami-41814f28 and ami-1b814f72).

This work was done shortly before Amazon introduced solid-state storage, EBS-optimized instances, and provisioned IOPS. We may examine these options in a followup post.

Thoughts on Benchmark Quality

Since the dawn of time, repeatability has been a critical topic for benchmarks. If you run the same benchmark twice, you often get different results. Cache state, background tasks, disk hardware glitches, and a thousand other variables come into play. For this reason, it’s always been good practice to run a benchmark several times and take the median result.

In the cloud, this tendency is vastly amplified. Remember that histogram of bulk write throughput across EBS instances? The variation from best to worst instance wasn’t a few percent, it was thirty to one. Thirty to one! This means that a single-machine test could easily report that small instances are faster than xlarge instances, or any other sort of nonsense.

Under these circumstances, for a cloud benchmark to have any validity, it must include data from many machine instances. Single-machine data is worse than useless; you simply don’t know whether you’re measuring application behavior, or random cloud background noise. Don’t just take single-machine cloud benchmarks with a grain of salt: ignore them entirely.

It’s also important to be very clear on what your benchmark is measuring. Application-level benchmarks are complex, and hard to generalize. Knowing how long a system takes to build the Linux kernel doesn’t tell me much about how it will handle MySQL. For that matter, MySQL performance on your workload may not say much about MySQL performance on my workload. So the gold standard is always to perform your own tests, using your actual application, under your actual workload. But failing that, the benchmarks with the greatest general applicability are those that measure basic system properties, such as I/O bandwidth.

With all this in mind, I’d like to propose a set of guidelines, which I’ll call “Cloud Truth”, for cloud benchmarks. (By analogy to “ground truth”, which refers to the process of going to a location and taking direct measurements of some property, to calibrate or verify satellite images or other remote sensing. More broadly, “ground truth” represents the most direct possible measurement.) For a result to be considered Cloud Truth, it should:

  1. Directly measure basic system properties
  2. Include measurements from many instances
  3. Over a substantial period of time
  4. Use clearly explained, reproducible methods (preferably including source code)
  5. Clearly explain what was measured, and how
  6. Clearly explain how the results were averaged, aggregated, or otherwise processed

This is not easy. (The results I’m presenting here fall short on “substantial period of time”.) But when reporting cloud benchmarks, these are the standards we must aspire to.

Limitations / Future Work

Considerable effort and expense went into these benchmarks, but there is still plenty of room to explore further. Some areas we haven’t yet touched on:

  1. How instances behave over long periods of time
  2. New AWS I/O options (solid-state storage, EBS-optimized instances, and provisioned IOPS)
  3. Tuning filesystem and RAID configuration
  4. Impact of EBS snapshots
  5. Variations across AWS zones and regions
  6. Other cloud providers
  7. Single-tenant and/or non-cloud machines

One Last Plug

If you’ve read this far, you probably take your engineering seriously. And if so, you’d probably like working at Scalyr. Why not find out? https://www.scalyr.com/jobs

If there’s interest, we’ll publish the raw data and source code for these benchmarks. Drop us a line at contact@scalyr.com.

Thanks to Vibhu Mohindra, who did all the heavy lifting to implement and run these benchmarks. Also to Steven Czerwinski and Christian Stucchio for feedback on an early draft of this post.


41 Comments on “Even Stranger than Expected: a Systematic Look at EC2 I/O”

  1. Justin Forder says:

    Please do publish your code, instance configuration and data. Have you discussed methodology and results with CloudHarmony? Has Amazon given any advice regarding your puzzling findings? Do your benchmarks produce the results you would expect when you run them on physical (rather than virtual) machines?

    Thanks for an interesting post, and for spending so much time and money on this investigation – I look forward to hearing more.

    Justin Forder

    • scalyr says:

      Thanks! There has been plenty of interest, so we will assemble the code and raw data and get them posted in a few days — you can check back on the blog.

      We didn’t discuss this work with any cloud providers in advance. Since publication, a few providers (including Amazon) have reached out. It would be great to have similar data from other providers (as well as other configurations, e.g. Provisioned IOPs as many people have mentioned). We’ll see whether we can find the time for that, and hopefully others can get involved once we post the code.

  2. Good to see a thorough analysis of IO performance. Empirical evidence of the bad apples is particularly useful. Be great to run those over a few days – maybe get Amazon to foot your bill for you ;-)

  3. Would love to see numbers from the m2 instances. Not really familiar with I/O dependent applications that make sense to run on an m1.large, much less an m1.small. We run our main database on an m2.4xlarge which I assume is fairly typical. Obviously for a database a big part of that is memory size, but it would be great to know empirically if and how I/O performance varies across the larger instances sizes.

    Also I can tell you right now that EBS Provisioned IOPS *really does work*. If predictable latency and throughput matter to you, you should be using it.

  4. Mat Young says:

    very interesting data, I started in enterprise storage in the 90′s and its amazing to me that we still measure storage performance in ms. I think AWS is a great service based on what its built on. However, I think in general planet earth needs to do better. I do still work in the industry but this view is my own.

  5. Wow, really a great read! I’m looking forward to your findings on provisioned IOPS on EBS!

  6. Great post, there is a lack quality analysis in this area, and as you know, creating such analysis is not an easy endeavor. I’ve been working on some similar analysis with EC2 and other providers, and the conclusions are somewhat similar. I think the big missing point in this post is with regards to EBS optimized instances and provisioned IOPS EBS volumes where we’ve observed a dramatic improvement in performance consistency. I think AWS has recognized this as being a significant pain point for users and these new EC2/EBS deployment options are a good answer to that. Here are links to a couple early summaries of the analysis I’ve done on EC2, Rackspace and HP (this is still a work in progress – I plan to publish a blog post soon):

    Disk Performance:
    The value columns is a percentage relative to a single baremetal 15k SAS baseline drive, where 100% signifies comparable performance. Benchmarks included in this measurement are fio (4k random read/write/rw; 1m sequential read/write/rw), fio – Intel IOMeter pattern, CompileBench, Postmark, TioBench and AIO stress:

    http://dl.dropbox.com/u/20765204/1012-disk-io-analysis/disk-performance.html

    Disk IO Consistency:
    The value column is also a percentage relative to the same baremetal baseline. However, below 100 represents higher consistency and above 100 represents lower consistency. The value is constructed by running multiple tests on a given instance, measuring the standard deviation of IOPS between each, and comparing those standard deviations to the baseline. Testing was conducted over a period of a month on multiple instances in different regions and AZs.

    http://dl.dropbox.com/u/20765204/1012-disk-io-analysis/disk-consistency.html

    • scalyr says:

      Thanks for posting this. It’s good to see comparisons across cloud providers, as well as a greater variety of instance types than we were able to include.

      The measurements here are a composite across all of the benchmarks (fio, CompileBench, etc.)? When you post your full results, it would be nice if you could also include a breakout all the way down to low-level operations (random read, random write, etc.), as well as more details on the variations across instances and over time.

      Some of the comparisons are unexpected / interesting. For instance, cc2.8xlarge shows a much worse consistency value than cc1.4xlarge. Just another data point showing that it’s hard to make generalizations.

    • Interesting comparison, but when you add the hi1.4xlarge ephemeral to the disk-performance table, you are going to have problems rating it, since you already used up “A+” and I think you will see about 10x better performance than the best non-SSD instance type.

  7. One other interesting observation we’ve experienced is that raid on multiple standard EBS volumes often results in greater performance inconsistency compared to just a single volume, likely due to small variations in latency or network paths to those underlying volumes.

  8. Omry says:

    Thank you very much for this, it will help me a lot in my considerations.

  9. Great post – some serious detail and very well thought out. The “variation across instances” chart is mind blowing – such high variance, such high latency.

    Was there an application reason to kill (or nuke) the cache? Scalyr logs will store a lot of data, but with a large working set size too? … I’ve seen many cloud environments running with FS cache hit ratios of over 99%, one at 99.97% (not on AWS), where nuking the cache would be a divorce from reality. It does simplifies the test (don’t need to worry about working set size and cache hit ratio), and maybe it does resemble Scalyr.

    I like the Cloud Truth list. Here are a couple of extra points:

    - Explain how the results were double/triple-checked (include command output)
    - Explain what you think the limiting factor is (as this post has done)

    Double-checking can be using different system tools to confirm (or at least sanity check) that the workload is doing what it says its doing. Eg, iostat(1) for disk I/O benchmarks (DTrace too if you have it). This will shake out a lot of dumb mistakes – like hitting from local cache when you meant to test remote (shouldn’t have happened here, except for the 20% on m1.xlarge, but always a good idea to double-check).

    The detailed histogram is also great. Note that it shows at least a bimodal distribution, if not more. And this came right after mean/median plots. :) This is why I’ve switched to latency heat maps where possible.

    Does anyone know if the bad apples stay bad apples? I wonder if they are always bad (eg, bad disks), or if this is contention with another tenant benchmarking at the same time. ;)

    • scalyr says:

      Correct, the throughput and latency measurements presented here won’t correspond to the observed performance in a real-world application. That wasn’t the intent; this is a microbenchmark meant to directly exercise disk I/O. Every application has a different working set, so a test that includes caching is harder to apply to a broad range of applications.

  10. Is the result a big surprise – you are effectively quantifying the impact of a random assignment of other applications contending a common IO infrastructure. Lots of variation observed more chance of collisions in big rather than small operations. I’d like to see a table with: Instance, mean IOPs/$, std. Also is it possible to hedge by region?
    Good work

    • scalyr says:

      We’ll work on getting our raw data posted here on the blog soon, so you can slice and dice it in different ways.

      It would be great to have data across regions (and service providers) — just not something we’ve had time for. There are so many factors to test that the number of combinations becomes difficult to manage.

  11. This is really great work. Thanks so much for sharing. This is really valuable.

    I had a question in regards to throughput. We just set up RAID-0 on xLarges ephemeral (so 4x drives). We’re going for max write throughput of relatively small data blocks / random IO. So I’m looking at your Write4K row of your throughput chart.

    Are the scales the same? Is your chart saying I could get better write throughput on a medium than an XL?

    • scalyr says:

      If you’re referring to the “Throughput over time” graphs, then the answer is no — the scales are not the same. (Otherwise the vertical axis would have been too scrunched up to show the individual instances.) There’s a note to that effect buried in the text; I should probably have mentioned it more prominently. What this graph does show us is that m1.medium instances were showing more *consistent* behavior than m1.xlarge for small writes — the variation between instances is smaller, and individual instances are much more stable over the 10-minute test period.

      The bar graph in the “Impact of Instance Size” section shows that we found m1.xlarge instances to have almost a 4x throughput advantage over m1.medium for small writes. That’s a mean across all tested instances; the other graph shows that some m1.xlarges will not do as well.

  12. Nice to see such detailed data about a topic that is often misunderstood, thanks for sharing.

    I’ve been working on benchmarking the IO performance of the AWS Elastic Block Storage for some time and recently published the results. You can read about the methodology and see the charts here: http://iomelt.com/iomelt-aws-benchmark-august-2012/

    Some interesting findings:

    - The same instance type shows a different behavior depending on the region that it is running, this is particularly critical if you depend on multiple regions for disaster recovery or geographical load balancing
    One could argue that now you can used EBS Optimized instances to overcome this “problem”, I’ve not tested these instances yet and not every instance type has this feature available

    - Generally speaking perfomance is better and more consistent in the South America region when compared to Virginia, this is probably due to the fact that SA region was the latest to be deployed. Maybe the SA datacenter uses new server models or it is just underutilized, but this is a wild guess

    - Write performance for the medium instance type in Virginia abruptly decays, droping from almost 400 call/s to something around 300 calls/s, this is not very clear in the scatter plot but if you draw a time-based chart you can clearly see this pattern. This is the main reason you see two spikes in the density chart. Read performance in the SA small instance show a similar behavior.

    - Small instances definitely should not be used for disk IO bound applications since its behavior is rather erratical even for read operations, this is particularly true in the Virginia region

    I’ve done the very same tests on several VPS providers here in Brazil and found some disturbing results. In one case the read and write performance simply plummets at 03:00AM, probably due to backup or maintenance procedures.

    Once again, kudos for the effort you’ve definitely put into this and many thanks for sharing.

    • scalyr says:

      Thanks for sharing! Measurements over a long period of time are very important, and something I wish we’d been able to do in our tests. It would be great if you could do more to boil down the results.

  13. This is a nice analysis, and a lot of good work. My main issue is that it implies (by omission) that it tested all the options for EC2 I/O, whereas it’s actually tested the low-end subset of the capabilities, and anyone interested in getting good I/O would gravitate to the high end instances and the high performance I/O options. Since the SSD based hi1.4xlarge and the EBS Provisioned IOPS options came out several months ago, there’s no excuse not to mention them in a post on I/O performance, especially since the reason they exist is to address the issues you measured.

    • To clarify, I see that you did mention those new options at the end of the post, but there was a lot of work put into getting these measurements, which is effectively wasted work. I would have concentrated on measuring the new options, and added a few measures of the older ones for comparison.

      • scalyr says:

        Yes, it is unfortunate that we were unable to include the newer options such as Provisioned IOPS. All of this work was actually done before Provisioned IOPs and hi1.4xlarge were announced; it took quite a while for us to get the post cleaned up and posted to the blog. Given the level of attention this post has attracted, we plan to do another round of tests with Provisioned IOPS. Look for the results here on the blog, though it may take a while.

        Including high-end instance types is something of a dilemma. In these tests I especially wanted to focus on variation across instances, which of course requires gathering data from many instances. For high-end instance types, this rapidly becomes expensive. I’ll consider how we might work around this.

  14. Raghuraman says:

    Such a detailed analysis. Kudos for the effort. As mentioned by you and others, the recent PIOPS and EBS-Optimized Instances address the long pending pain points of having consistent IO. A benchmark including those would throw better insights. Were these tests run in multiple AZs? There are other server benchmarks which mention that newly deployed AZs perform better. For example, availability of better Intel processors in new AZs which stand to perform 1.2X-1.5X better than older procs. I am sure that would be the case with I/O as well when we put multiple AZs to test

  15. Quora says:

    What’s the best choice for storage when running HDFS in the cloud: S3, EBS, or Instance Storage?…

    Hi Eric. An investigation recently showed that ephemeral storage is more cost effective as well as faster than EBS volumes (http://blog.scalyr.com/2012/10/16/a-systematic-look-at-ec2-io/) Also, if I deploy a cluster and use an HDFS replication factor o…

  16. Tariq says:

    Great post with some meaningful insights however i have a few questions. Why did you choose the ext3 as file system considering ext4 is fairly stable now and delivers reasonably better writes random/sequentially ? Isn’t EBS an already raided hardware , why RAID it again that too with RAID 0 ?when RAID 10 is used for all practical purposes for critical data store by most. Is there a concrete study around RAIDING EBS volumes, if so please share it.
    Thx fur your effort

    Tariq.

    • scalyr says:

      ext3 vs. ext4: we were primarily concerned with measuring variability across instances and over time, rather than absolute performance. Hence, we did not spend much effort on tuning. ext3 is the default filesystem on the standard Amazon Linux AMI.

      RAID: it is common to use RAID to achieve higher throughput on EBS, so we wanted to measure the effect. Yes EBS already incorporates some level of redundancy, but an individual EBS volume has finite I/O throughput; RAIDing multiple volumes can increase throughput (as shown here and in many other studies).

  17. [...] Even Stranger than Expected: a Systematic Look at EC2 I/O → [...]

  18. Great information, thanks for posting… I too would be interested in performance across cloud providers.
    Not sure if you came across the Ravello announcement yet but check it out – >

    http://www.ravellosystems.com/

    You could upload the vm’(s) that runs this test and very easily deploy it to amazon (east and west) and rackspace with out doing any extra work… it’s quite cool.. and just happens to be free during the beta launch. They even cover your vm cost!!!

    They plan to add more and more clouds soon so having this common blueprint repeat and repeat would be a piece of cake… Few other variables involved as today the interface doesn’t allow you to size the backend instance but results would be interesting…

    Thanks again
    Kyle

  19. katemats says:

    Thank you for sharing this great analysis. It is a must read for anyone using AWS. Thanks!

  20. [...] Специфика виртуализации EBS тормозит. Производительность EBS нестабильнаhttp://blog.scalyr.com/2012/10/16/a-systematic-look-at-ec2-io/ [...]

  21. [...] производительность EBS нестабильна (подробнее см. эту ссылку), а пропускная способность инстансов довольно разная [...]

  22. Nick says:

    I’m not sure if it was clearly outlined but ephemeral storage I/O is not billed and EBS I/O is.

  23. […] questo argomento è oggetto di un interessante benchmark da parte di scalyr.com, benchmark che dimostra quanto L’I/O delle istanze AWS possa essere poco prevedibile oltre […]

  24. […] questo argomento è oggetto di un interessante benchmark da parte discalyr.com, benchmark che dimostra quanto L’I/O delle istanze AWS possa essere poco prevedibile oltre che […]

  25. […] Even Stranger than Expected: a Systematic Look at EC2 I/O (by Scalyr) […]

  26. […] interested in quantitative analysis of cloud performance, you might like our older post, A Systematic Look at EC2 I/O. And you might also be interested in our server monitoring and log analysis service, Scalyr Logs. […]

  27. […] Even Stranger than Expected: a Systematic Look at EC2 I/O (by Scalyr) […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 28 other followers