The EC2 EBS First-Write Penalty
Uncategorized April 12, 2010I wanted to summarize some information on the EC2/EBS “first-write penalty” and give myself a container for future information that arises with respect to it.
Amazon’s docs say:
Due to how Amazon EC2 virtualizes disks, the first write to any location on an instance’s drives performs slower than subsequent writes. For most applications, amortizing this cost over the lifetime of the instance is acceptable. However, if you require high disk performance, we recommend initializing drives by writing once to every drive location before production use.
Although Amazon says that you typically do not need to worry about it, I do prefer to initialize our drives because we have an append-heavy workload (we store a lot of data for analytics). In any case, initialization is very important before you benchmark, because benchmarks may allocate a lot of space in an unrealistically short period of time.
Note that this quote is also about instance stores. According to Amazon technical support (though not EBS documentation) a similar issue applies to EBS.
1. There is an extra penalty on both ephemeral local spindles, and EBS volumes, for first write. For EBS there is an identical first-read penalty.
Note the second sentence…EBS also has a first-read penalty, if the first access is a read.
Unfortunately, in one circumstance, we put a database into production before doing the usual initialization step. This lead to a bit of a conundrum: did I really want to write to every unused block on the same volume as my database server? Could I do that easily without causing fragmentation, slowdown, or out-of-space risks? This particular database is not running on RAID because it has a requirement for snapshots. The RAID initialization step would have obviated the problem. (in the future we’ll handle snapshots from a real-time replication slave).
The interesting thing about the “first-write penalty” is that it is actually misnamed, according to Amazon Technical Support. It is also incurred if the first access to a block is a read:
The EBS penalty is best described as a first-use penalty.
So, if you have written a block, you will take a penalty on the first write, but not on the (subsequent) first read of the block.
Of course, real applications (filesystems et al) always write before they ever read, so real applications will always experience the EBS first-use penalty as a first-write penalty.
But in this particular case, it was more convenient for me to do a first-read rather than a first-write. Unix is cool (and was in its day revolutionary!) because it exposes the underlying device to you as a “file”, which you can work with like any other file. In this case, let’s pretend our device was called “/dev/sdx1″.
My first thought was to use “cat /dev/sdx1 > /dev/null”. That would have read every byte in a very elegant Unix-y way. But I didn’t want to interfere with the load on the server. nice and ionice were not doing the job (I didn’t investigate why, exactly).
So I just wrote a Python program that was designed to read a bit of data, pause to let the drive handle other traffic, then read a bit more. It dynamically increases its chunk size to approximate a goal of 0.05 seconds of read time for every 0.25 seconds of clock time. Of course, if the server had been working flat out, then even this little bit of workload might have had negative consequences. But if your server is frequently working flat out, then you’ve got bigger problems.
What would have happened if the server had been working flat-out is that the chunk size would have shrunk to a single byte every 0.25 seconds. This would have “stolen” 4 of the roughly 100 operations per second that I could expect from the disk, a 1 out of the roughly 100MB per second. It would also have taken approximately forever to finish.
import os, sys
import time
MiB = 1000**2
chunk_size = 5 * MiB
f = open(sys.argv[1], "r")
f.seek(-1, os.SEEK_END)
filesize = f.tell()
f.seek(0)
print "Size", filesize
count = 0
data = "dummy"
while 1:
before = time.time()
data = f.read(chunk_size)
count += chunk_size
delta = time.time() - before
print time.ctime(before), "%5f" % delta , "%5f%%" %(100 * float(count)/filesize), count, "of", filesize, "incrementing", chunk_size
if not data: break
# adjust read speed dynamically
if delta0.05: chunk_size/=1.25
chunk_size = int(chunk_size)
time.sleep(0.25)
print "done"
Given that this is a production system with so many other variables, it’s hard for me to say whether initializing the disk really made a measurable difference or not. Although it is only tangentially related, I found this article (also linked above) about EC2 performance modelling to be quite helpful.
Note also that the first-read technique only applies to EBS disks, not ephemeral ones.
2. Unless you first fill an ephemeral local drive with data, you will get ‘free’ reads against it. So, doing a read-only benchmark on a virgin ephermal drive will give excellent, but inaccurate, results.
