Tag Archives: Event Sourcing

Why Can’t I Update an Event

Last week on a call with someone the question came up about the Event Store about why can they not update and event and how should they handle the case where they need to. The conversation that came out of this was very rich in architectural insight into how the event store works as well as overall event sourcing understanding so I thought that it would be worth spending a bit of time to write up where the constraint comes from.

An Event Sourcing Perspective

Let’s start with why you want to update an event? An event is a fact that happened at a point in time with the understanding of it from that point in time, a new understanding of the fact would be a new fact(naturally this does not apply to politicians). To update a previous event is generally a very bad idea. Many want to go back and update events to new versions, this is not the best way to handle versioning!

The prefered mechanism from an event sourcing perspective is to write a new face that supercedes the old fact. As an example I could write that event 7 is a mistake, this is a correction, I might as well put in a comment “this was due to bug #1342” (similar to a journal entry in accounting). There are a few reasons this is a better way of handling things.

The first is my ability to look back at my history. If I were to change the fact and I look back at that point in time I have changed what it means. What about others who made decisions at that point in time? I can no longer see what it was they made decisions off of. Beyond this I might have a very valid query to ask your event streams of “how long on average does it take us to find bugs in previous events”.

The second model leads us to two types of queries supported on event streams (as-of vs as-at).

Beyond that with Event Sourcing the updating of an event can be inherently evil. How do you update any projections that the update occured? What about other subscribers who may be listening to the streams? Any easy answer might be to replay fully all involved with the stream but this quickly falls apart.

These are the primary reasons why the Event Store does not support an update operation on an event. There are however some wonderful architectural benefits that come from this particular constraint.

Architectural Goodness

If we prevent an event from ever being updated, what would the cachability of that event be? Yes it would be infinite. The Event Store supports a RESTful API (ATOM). All events served from the event store have infinite cachability, what does that mean?

Imagine you have a projection updating into a SQL table that has been running for the past eight weeks. You make a change and need to restart it (replaying from event 0). When the replay occurs and it requests events from the Event Store where do they likely come from? Your hard drive! You don’t make requests to the Event Store for them.

Beyond the events being infinitely cachable if you look through our atom implementation in fact every single request we serve with the exception of the head uri (http://somewhere.com/streams/{stream}) is also infinitely cachable. In other words when you want to reread $all (say for 5m events) you will hit exactly one non-cachable request!

This is very important when we start talking about scalability and performance. The Event Store can pretty easily serve 3-5k atom requests/second on a laptop (per node in clustered version) but how many will actually get to the Event Store? In order to scale you focus on commoditized reverse proxies in front of the Event Store not scaling the Event Store itself. nginx or varnish can easily saturate your network, just drop them in front only head calls make it through (and there is even a setting per stream to allow caching for x seconds of head links).

This is often a hard lesson to learn for developers. More often than not you should not try to scale your own software but instead prefer to scale commoditized things. Building performant and scalable things is hard, the smaller the surface area the better. Which is a more complex problem a basic reverse proxy or your business domain?

This also affects performance of replays for subscribers as you can place proxies near the subscribers (local http cache is a great start!). This is especially true for say an occasionally connected system. Gmail uses this trick to provide “offlining out of the box” for web apps. Since much of the data will already be in the http cache your hits will be hitting it, in many cases you can build say a mobile app with no further support.

Over Atom if we allowed updates, NO uris could be cachable!

This is all cool but I actually need to update!

Occasionally there might be a valid reason why an event actually needs to be updated, I am not sure what they are but I imagine there must be some that exist. There are a few ways this can actually be done.

The generally accepted way of handling the scenario while staying within the no-update constraintis to create an entire new stream, copy all the events from the old stream (manipulating as they are copied). Then delete the old stream. This may seem like a PITA but remember all of the discussion above (especially about updating subscribers!).

Some however may be using the TCP API and are willing to take the pains and complexity that come from subscribers (you can imagine they have none). In this one case, updates would be acceptable and simpler than adding new events. We have been going back and forth on whether or not to support this. It would not be much work at all for us but I imagine that it would be misused 1000 times for every 1 time it was used reasonably. I am reminded of the examples of being able to call .NET code from a biztalk orchestration or being able to execute .NET code inside my SQL database both have valid uses but should rarely be used. Perhaps we will make a command line parameter –run-with-scissors-updates or make people build from source to enable.

Events and Generic Formats

There was some interesting discussion before I left for Nepal about Event Stores. The general question is “can you have a generic event log similar to a transaction log in a database”. A related question is what is the difference between an event log and a transaction log.

Having an event log is not a new idea, its been around for decades. Databases due something very similar inside of their own transaction log. The major difference between an event log and a transaction log is that by definition an event log also captures intent. Consider the difference between:

INSERT:
RecordType: Customer
Id: 17
Name: Greg
Status: Normal

and

CustomerCreated:
Id: 17
Name: Greg
Status: Normal

There are many semantic and linguistic differences between these two concepts. The first would be a transaction log and the second an event log. With a create these differences can be very subtle. Let’s try something less subtle.

UPDATE:
RecordType: Customer
Status: Gold
Id: 17

CustomerAutomaticallyPromotedToGoldStatus
Id: 17

Here intent is quite obviously different between the two messages. There could be a second event CustomerManuallyOverridenToGoldStatus which represents a manual override of our algorithm for dealing with customer promotions. A subscriber may care *why* the customer was promoted. This concept of intent (or “why”) is being represented in the event log.

As an important concept. If you can have two events use the same transition then you are by losing information

Things get to be a bit harry though and this is where the discussion started falling apart. I wish I could have dropped in a longer response but was travelling at the time. Can’t we model the first to be equivalent to the second? We see something in RESTful APIs.

UPDATE:
RecordType: Customer
Action: AutomaticPromotion
Id: 17

YES you can do this! This produces a record that captures the intent as well. In fact this is how my second event store worked. There are lots of reasons you may want to do this (such as ability to use a generic state box on top in certain HA scenarios with the event store).

We can just consider this a different serialization mechanism. The key is that everything still maps directly back to a single event.

Now let’s get back to that original question of “event log vs transaction log”. An event log includes business level intent, this is not needed with a transaction log. An event log is a stricter definition of a transaction log. I don’t need to store the intent in order to be a transaction log though we can have a really interesting discussion about what the “business events” are in a database domain 🙂.

Is an Event Log a type of journal or transaction log? Yes. I like to think though that even if you use the generic update as in the third example above it requires that you specify intent. Intent is a valuable thing. Can I build a transaction log that completely captures intent and does not lose information? Sure think about a database with a “Transaction” table. I would say this is actually just a serialization mechanism with the intent of being an event log.

If I don’t store intent there are an entire series of questions I can no longer ask the data.

Case Studies

So there was a question on StackOverflow today that I got linked into. I decided to answer the question here instead.

There’s lots of information out there about CQRS and Event Sourcing. But who’s actually using it in practice / in production? I tried to find references on the Internet, but couldn’t find any.

(This is not really a programming question perhaps, but this seems to be the most appropriate place to ask. I got asked this yesterday when doing a presentation to colleagues on these topics.)

There is a lot wrong here. Event Sourcing in particular and in conjunction with “CQRS” (note its basically a pre-requisite if you want to query current state at all) is a very very old concept. There are many thousands of systems in existence that do things this way. In fact there are so many and its such a core concept that writing up a case study on us “doing it this way” is frankly a waste of time.

That “transaction file” thing in your database? Event Sourcing. There are countless systems using these ideas. Smalltalk images are built up this way. Research brings the ideas back to the dark ages. Can probably get further back than that but research time is expensive when you have lots of other things to do.

Deltas + Snapshots or Separating Reads from Writes are in no way new concepts.

Umm.

But before we get into the dysfunctions let’s ask a really basic question. Does a Case Study from “Some Marketing Guy” or Pretentious Ivory Tower Architect #14 at We’re So F#%*ing Large You Can’t Imagine How Big Our Towers Are Corporation have anything at all to do with the success or failure of a project on your team?

Have you ever actually read these so called Case Studies? Most are terrible. Here let’s try some http://www.brainbench.com/pdf/CS_IBM.pdf this is from Brain Bench about their software. Go on read it. Is this a case study or a piece of marketing? Here let’s try these instead http://www-01.ibm.com/software/success/cssdb.nsf/topstoriesFM?OpenForm&Site=cognos&cty=en_us. How about from Oracle? http://www.oracle.com/technetwork/database/features/ha-casestudies-098033.html These are not “case studies” they are marketing.

So people who are managing risk want marketing materials that try to look like research to help them make their decisions? Interesting risk mitigation strategy. Perhaps they are mitigating their-ass-gets-shown-the-door risk by trying to show they did something/anything to mitigate risk.

Beyond that even if they were awesomely written well thought out discussions (which they aren’t). What applicability do they have to your team and environment? If we are talking about a TOOL then yes case studies may have merit (a switch/logging tool/etc in a similar organization as an example) but to write case studies about a CONCEPT? Next thing you know someone will patent it and sue Oracle for violating their patent, courts will award billions to the originator of the idea though he’s been dead for 1000 years.

Think for a moment what my “Case Study” might look like for the simplest CQRS system “We put our commands on this service and our queries on that service …”. Do you think the success or failure of the project was due to this decision? What I really want is a case study on the value of these case studies (with empirical data of how many perfectly good donuts have gone wasted)

The Real Game

Why are we getting these questions? What is the serious risk that people are trying to mitigate? It must be a pretty big risk if they want to do research and prove out this decision before looking at implementing it. If we are willing to spend 2 weeks on this we must be mitigating a pretty large risk later.

Dysfunction

The serious risk is they want to implement CQRS + Event Sourcing everywhere. They want cookie cutter “architecture” (I use the term as loosely as possible) that they will follow everywhere. Yes if you attempted to do this with CQRS + ES it would be a massive risk. That’s precisely why we don’t do this.

CQRS is not a top level architecture.

CQRS is applied within a BC/component/whatever people want to call things tomorrow. It is not applied globally. When we talk about applying things like CQRS and ES we must leave the tower. If I can rewrite the whole thing from scratch in 9 days why are we spending two weeks “proving out” our ideas in meetings on whiteboards (the meetings are much more tolerable if you have coffee and donuts … but the best ones have a good fruit juice selection as well for future reference)? There might be some core places where this kind of risk mitigation is justified but they are few and far between.

Homer

The systems we discuss look at risk in a very different way. The are designed to be responsive to change and to minimize the costs of failure. Instead of spending the next 6 months designing the stuff welcome to our world of we-actually-do-stuff. Let’s actually build out that thing we were talking about. Week or two later we have it done. Its not some abstract picture on the wall. We actually did it. You would be amazed how much code you can write on a pair in a week.

I have had so many occasions where as a consultant I had people shut up and code. I have seen abstract ideas that were being discussed for two months implemented in an afternoon.

As Zed would say. Its CEBTENZZVAT, ZBGURESHPXRE DO YOU SPEAK IT? We get so far up our own asses drawing pretty pictures and discussing abstractly the most minute details of systems (of course before we generalize it to solve all the problems nobody has) we forget that our job is to actually do stuff.

Now this is not to say that we are without a net. We always manage risk just in different ways. We reduce costs of failures instead of doing loads of upfront analysis (sound familiar?). Our risk management is in the prevention of high impact due to change (strategic design). I have to admit one thing that has always made me laugh (and this is common) is when people spend 6 months choosing which database/platform they will use going through all sorts of “vetting” processes then have absolutely no concept of strategic design in their software.

“Well we spent all this time picking this stuff because its the decision we use everywhere”

Why is it I can get management to accept that for our analysis of what the software should do upfront doesn’t work but I can’t get them to accept the same concept about our hair brained upfront architectural risk decisions?

Dilbert

Think back to when you first started focusing on minimizing cost of failures instead of preventing risk (its a common theme). Are you dealing with the type that actually wanted “case studies” and weeks of analysis to help prevent the risk from the concept of minimizing cost of failure instead of preventing risk? OK then RUN! Hell even in waterfall there are backwards pointing arrows.

Screen Shot 2012 03 02 at 4 41 51 AM

Yep!

Disclaimer this post is deliberately far to one side to pull people bad to the reality. I do not actually believe that we should never do upfront risk mitigation I believe that we do way too much of it.