Durability of Writes

Just finished reading some upcoming posts by Ayende I think they will be out this week and are an interesting series. They are in regard to durability of writes in particular with a journal. In dealing with Event Store we have done a lot of work with this. In particular his discussions get into looking at the differences between appending a log in batches with occasional fsync/flushfilebuffers vs other methods such as o-direct/write-through etc.

If only it were so simple. IO is hard. There are many places between you and the disk and getting everything right (pun intended) requires a ton of testing/knowledge. It gets even more fun when you consider the number of subsystems under you (are you even talking to a local disk? does that disk have caching? is the caching safe? does the controller ignore fsyncs?)

By the way: promotion this is actually the subject of my talk at Build Stuff in Vilnius Dec 9-11 hope to see you there 😉

Let’s start with what is good about batch writes + fsync, this is the default in event store. It is the most compatible and likely to work on an unknown system. Forget about even locally connected drives how will directio work with 4kb page aligned writes when you are talking to a networked device (really your head will start hurting). It has however some big problems. The largest problem is that you are not a good neighbor. When you fsync you do not only fsync your file you fsync all writes that are going to that disk. This can cause all sorts of interesting latency issues where you get spikes because something else is writing to the disk. Your constant fsyncing will also affect performance of everything else running on the box. On a side note your constant fsyncing can sometimes make other systems that forget somewhat less likely to fail!

A second issue with fsyncing/flushfilebuffers is that it also flushes all metadata changes to the disk. This can cause a large number of writes that you may not need to be done to be written to the disk. This is especially bad when you consider that it can cause seeks in the process.

I have just finished implementing for windows O-DIRECT aka unbuffered IO for windows as an option. I have started working on the posix implementation as well. We will be running it through our circles of hell (power pulling clusters) to validate durability etc for a few weeks before release. Once done it will be available for everyone to use OSS is great!

This is one aspect that is often not considered in such decisions. Sure it may only be 500loc but have you actually made sure it really is durable?

One Comment

  1. Posted November 19, 2013 at 11:41 am | Permalink | Reply

    Having a fast truly persistent storage as a library would be awesome. Especially licensed under a permissive license.
    Publish the link as soon as you have something.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: