The best serialization strategy for Event Sourcing

Published in

SoftwareMill Tech Blog

6 min readJun 19, 2019

When implementing Event Sourced system, at some point, you will need to answer two major questions:

Where to store events?
How to store events?

I will address the first question in a separate post. Right now I want to focus on the second one, and try to analyze different serialization strategies for events (and snapshots) to store and read them effectively.

Probably the majority of people would choose JSON — a pretty obvious preference. It’s easy to implement, supported natively by some DBs like MongoDB, PostgreSQL, and others. Any technology stack is able to parse or produce it. JSON format is heavily used and has many advantages. It is the best option as a default choice.

What if I tell you that there are other solutions? No, no, no, I’m not thinking about XML or YAML. Instead, there are plenty of binary serialization options. Let’s start with plain text vs. binary serialization comparison.

The first advantage of plain text serialization is that this format is human-readable. It’s quite convenient to select some events from DB and analyze them directly from the query result. A binary format requires an additional step, where bytes are transformed into something readable. Is it a real problem? Usually, when you need to debug something on production you will very quickly create a tool to do this for you automatically. Of course, when choosing binary format, such tool should be created upfront, but there’s no better motivation than an issue in production.

Plain text formats, especially JSON, have some problems with number precision, just follow these links: JSON IEEE 754, DoS.

The binary format is, obviously, more compressed, so storage usage will be lower. The actual profit depends on your events model. In my case, it was 60% to 70% savings on disk space. Someone could argue that storage is cheap and this point is irrelevant. In general, that’s true, storage prices usually go down, but at the same time, the amount of data we are gathering is growing even faster. Not to mention data replication, required by some databases like Cassandra. Also if you check the prices of fast and large SSD drives for DB purposes the actual costs are not so small in my opinion.

All benchmarks, at the time of writing this post, are pretty consistent, binary serialization is much faster than plain text serialization. This shouldn’t be a surprise since plain text formats require a lot of additional steps when serializing or deserializing something (check here and here for some details).

The last point is schema evolution support. This is an extremely important feature for serializing and deserializing events, because it will allow you to version your events without typical versioning pains (postfixes like _V1, migrations, version adapters, etc.).

Clearly, binary protocols are a pretty enticing option, but which one should we choose:

Java serialization
Kryo
Thrift
Protocol Buffers
Avro
…

Java serialization — let’s not waste time on this horrible mistake.

Kryo — very fast, very compact, but it works only on JVM, there is no point in limiting our infrastructure to only JVM applications. Maybe some crazy NodeJS developer would also like to read our events.

Thrift — from Facebook, almost the same when it comes to functionalities as Google’s Protocol Buffers, but subjectively Protobuf is easier to use. Documentation is very detailed and extensive. Support and tools for Java and Scala are on a very good level. That’s why I have chosen Protocol Buffer vs Avro (from Hadoop) for the final comparison.

The detailed comparison can be found in this great post by Martin Kleppmann, so there is no point in describing these technologies once again, but let’s focus on aspects like speed, size and schema evolution.

Speed*

The difference is so small that it can be ignored. Both solutions are really fast.

Size*

Serialized payload size is more or less the same for both technologies.

Schema evolution

A tie again. Avro and Protocol Buffers will provide you full compatibility support. Backward compatibility is necessary for reading the old version of events. Forward compatibility is required for rolling updates — at the same time old and new version of events can be exchanged between micro service instances.

+--------------+------------------+-------------------+
|              | Protocol Buffers |       Avro        |
+--------------+------------------+-------------------+
| Add field    | + (optional)     | + (default value) |
| Remove field | + (optional)     | + (default value) |
| Rename field | +                | + (aliases)       |
+--------------+------------------+-------------------+

Adding, removing or even renaming a field is not a problem. From my perspective, it will be easier to achieve in Protocol Buffers (aliases and default values in Avro might be sometimes quite challenging), but it is possible in both options.

Schema management

What is the difference then? Well, the biggest difference between Avro and Protocol Buffers is how they deal with schema management. In Protobuf you need to create a schema first. Then, using tools like ScalaPB you will compile the schema and generate Scala/Java classes (as well as parsers and serializers for them).

And now the only thing you need to do is to translate your rich domain event to the Protobuf version.

In the end, you will write a lot of methods like:

def toDomain(event: UserCreatedEvent): UserEvent.UserCreateddef toSerializable(event: UserEvent.UserCreated): UserCreatedEvent

Avro approach is at first glance much simpler. From your rich domain event you can generate Avro schema thanks to libraries like avro4s.

Of course, some custom mappings to understand types like UserId, Email, etc. will be necessary. Similarly to most of the JSON mappings frameworks.

Summary?

What should you choose then? Avro, especially at the beginning, seems much easier to use. The cost of this is that you will need to provide both reader and writer schema to deserialize anything.

Sometimes this might be quite problematic. That’s why tools like Schema Registry were developed.

The next problem you might face with Avro is the overall impact on your domain events. At some point, Avro can leak into your domain. Some constructions like e.g. a map with keys that are not strings are not supported in Avro model. When the serialization mechanism is forcing you to change something in your domain model — it’s not a good sign.

With Protocol Buffers, schema management is much simpler — you just need schema artifact, which can be published as any other artifact to your local repository. Also, your domain can be perfectly separated from the serialization mechanism. The cost is the boilerplate code required for translation between domain and serialization layers.

Personally, I would use Avro for simple domains with mostly primitive types. For rich domains, with complex types and structures, I’ve been using Protocol Buffers for quite some time. Clean domain with no serialization influence is really worth paying the boilerplate code price.