Just finished reading some upcoming posts by Ayende I think they will be out this week and are an interesting series. They are in regard to durability of writes in particular with a journal. In dealing with Event Store we have done a lot of work with this. In particular his discussions get into looking at the differences between appending a log in batches with occasional fsync/flushfilebuffers vs other methods such as o-direct/write-through etc.
If only it were so simple. IO is hard. There are many places between you and the disk and getting everything right (pun intended) requires a ton of testing/knowledge. It gets even more fun when you consider the number of subsystems under you (are you even talking to a local disk? does that disk have caching? is the caching safe? does the controller ignore fsyncs?)
By the way: promotion this is actually the subject of my talk at Build Stuff in Vilnius Dec 9-11 hope to see you there😉
Let’s start with what is good about batch writes + fsync, this is the default in event store. It is the most compatible and likely to work on an unknown system. Forget about even locally connected drives how will directio work with 4kb page aligned writes when you are talking to a networked device (really your head will start hurting). It has however some big problems. The largest problem is that you are not a good neighbor. When you fsync you do not only fsync your file you fsync all writes that are going to that disk. This can cause all sorts of interesting latency issues where you get spikes because something else is writing to the disk. Your constant fsyncing will also affect performance of everything else running on the box. On a side note your constant fsyncing can sometimes make other systems that forget somewhat less likely to fail!
A second issue with fsyncing/flushfilebuffers is that it also flushes all metadata changes to the disk. This can cause a large number of writes that you may not need to be done to be written to the disk. This is especially bad when you consider that it can cause seeks in the process.
I have just finished implementing for windows O-DIRECT aka unbuffered IO for windows as an option. I have started working on the posix implementation as well. We will be running it through our circles of hell (power pulling clusters) to validate durability etc for a few weeks before release. Once done it will be available for everyone to use OSS is great!
This is one aspect that is often not considered in such decisions. Sure it may only be 500loc but have you actually made sure it really is durable?