Comparing Apache Kafka and Apache Pulsar

When to use Pulsar and when to use Kafka, and why.

Jaroslaw Kijanowski
SoftwareMill Tech Blog

--

Since 2015, when Kafka was at version 0.8 and didn’t even have a major release, we’ve been already incorporating Kafka in several software architectures. Being a Confluent Partner, at SoftwareMill we rely on Kafka in various commercial projects and it proved to be a reliable tool for data streaming. However, it seems there’s a new kid on the block, promising to be quite competitive. Therefore we decided to evaluate Pulsar and compare it with Kafka. Here are our findings, especially from a ready-for-enterprise point of view.

Kafka and Pulsar measuring their strength — photo by Goutham Ganesh Sivanandam on Unsplash

What is Apache Pulsar?

Quoting the home page:

Apache Pulsar is an open-source distributed pub-sub messaging system

Sounds familiar. And there are more similarities — Pulsar Functions (sounds like Kafka Streams), horizontal scalability, strong durability guarantees, geo-replication, authentication, authorisation and quotas, client libraries and operations utilities. One thing that is fundamentally different is the persistent storage. While in Kafka logs are persisted on the brokers, Pulsar uses Apache BookKeeper — more on this later.

I can recommend this blog post for a better Understanding How Apache Pulsar Works.

Architecture comparison

At first glance Pulsar’s architecture seems to be very complicated:

  • a Pulsar instance consist of several clusters
  • each cluster consists of several brokers, a Zookeeper cluster and a BookKeeper cluster
  • the instance also needs an additional Zookeeper cluster to maintain global configurations

In practice in most use cases one doesn’t need a multi-cluster setup, so things are a bit simpler. Anyway — comparing to Kafka it’s still more complicated — this is the price for decoupling brokers from the storage.

http://pulsar.apache.org/docs/en/concepts-architecture-overview/

On the other hand Kafka’s architecture is less complex. There is even an undertaking in removing ZooKeeper, which will make Kafka’s setup even simpler!

Storage model differences

Although both systems are designed to store messages for an infinite period of time, they differ w.r.t. the layer where messages are stored. Kafka uses a log that is distributed among brokers which form a cluster. Pulsar delegates persistence to another system called Apache BookKeeper. This seems to be a real advantage especially when it comes to scaling, as explained a bit later. Another tempting feature of BookKeeper is tiered storage. Old and less frequently used data can be put on slower and more cost effective storage like Amazon S3. Storage space can be extended as much as your money bag allows for it.

Message consumption flexibility

Kafka is a streaming platform. It has the concept of topic partitioning which allows parallel consumption on the topic level and guarantees ordering at the partition level.

However you cannot distribute the load by simply adding more consumers. The number is limited by the number of partitions and you should plan ahead and over-partition by default, since increasing the number of partitions later on may be tricky when message keys play a role in ordering or when uneven partitions need to be rebalanced.

Kafka was not designed to be a queue though. It does not allow to acknowledge individual messages, but only up to a point, called the watermark.

This is where Pulsar shines. It has four different kinds of message consumption concepts also known as subscriptions.

  • Exclusive and failover subscriptions are covering the streaming use cases, where ordering per partition is required.
  • A shared subscription on the other hand allows to throw in many so called competing consumers for a given partition to parallelize the processing of stored messages. A typical use case where a queue is required. This has been explained in a very detailed comparison of Kafka’s and Pulsar’s messaging model.
  • The last kind of a subscription is key_shared, where as in shared mode multiple consumers can fetch messages, but messages with a particular key will be processed on only one consumer. This approach does not require the topic to be partitioned. Be aware though there is an open issue making this type of subscription useless.

How strong is the community?

According to Confluent:

Kafka has over half a million words of official documentation, 13 textbooks, a rich site of tutorials, demos, podcasts, and video tutorials, more than 18,000 questions on Stack Overflow, online courses from Confluent, Udemy and more.

That’s true — Confluent made a huge investment into content marketing. Additionally the quality of blog posts, videos, demos and online courses by Tim Berglund, Robin Moffatt as well as Stephane Maarek is top-notch. Until Pulsar will find a backing company willing to invest on this front, the documentation and a rather small community must suffice. One Pulsar summit has been already held 👍

Enterprise support and ecosystem overview

Speaking of a backing company — in case of Kafka, you can get managed solutions by Amazon (AWS MSK) and by Confluent itself, just to name a few. This is not of surprise, having Kafka in the field for so many years. In consequence, there are also several by-products available, when employing Kafka in your architecture. The Schema Registry ensuring only valid and backward/forward compatible messages get into the broker, the REST proxy useful when there are no native Kafka clients available for a given platform, Kafka Streams and ksqlDB extending Kafka to a full blown streaming platform, Kafka Connect providing capabilities to ingest and export data and the Control Center for operations.

Doesn’t mean, Pulsar lost this one. Kesque, formerly known as Kafkaesque, I assume pun intended, looks very promising. Especially that they provide a clear pricing table! Another company, Streamnative, recently emerged and is run mainly by Pulsar and BookKeeper developers. Finally Splunk acquired Streamlio. The question is, are these investors willing to take Pulsar to the next level, like Confluent already did with Kafka.

There is a major shift towards cloud based computing and Confluent is on its way. They already provide Kafka in the cloud deployed on major cloud providers and are now taking the next step with project Metamorphosis — providing event streaming capabilities to the end-user which requires to extend the current solution. Sounds a little bit mysterious and indeed it is. So far two features have been revealed — elasticity and cost effectiveness.

The ease of scalability

In theory it’s possible to scale up and down a Kafka cluster by adding or shutting down nodes. However if a Kafka cluster loses a node, it expects it to come back online soon and does not perform any re-balancing of existing partitions. Same is true when adding a node. Kafka doesn’t move data automatically to new brokers, which results in unbalanced leaders and disk utilisation across brokers. The confluent-rebalancer tool and the commercial Confluent Auto Data Balancer can restore balance in a cluster and you need to get familiar with those. Downscaling a Kafka cluster deployed on Amazon MSK is not supported at all.

Pulsar brokers on the other hand are stateless — they are not responsible for storing messages on their local disk. Therefore spinning up or tearing down brokers is much less of a hassle.

The separation between brokers and the storage allows to scale these two layers independently. If you require more disk space to store more or bigger messages, you scale only BookKeeper. If you have to serve more clients, just add more Pulsar brokers. With Kafka, adding a broker means extending the storage capacity as well as serving capabilities.

Other issues may arise when you have a lot of partitions. In Kafka terms a lot means ~200 000. The more partitions, the more opened file handles, increased unavailability when leader election is triggered by an unclean shutdown of a node and higher end-to-end latency due to replication. Pulsar on the other hand promises to scale up to millions of topics. It supports partitioned topics as internal topics, one per partition. But since most of the mentioned issues are related to the way how data is stored, Pulsar is not affected by them. I wonder however, how Bookkeeper deals with these problems. And this leads us to:

Lies, damn lies and benchmarks

Of course my system is faster because it leverages zero-copy reads and can process trillion messages a day — mkaaay.

On the other hand benchmark tests have shown that Pulsar delivers higher throughput along with lower and more consistent latency. Faster and more consistent is better. Srsly?

But let’s dig into the details. The zero-copy reads in Kafka are nothing more than omitting the application layer when moving data from disk to the network socket. The broker is requested to serve some data, makes a request to the disk and then the response goes straight to the network layer. But there is more — this approach leverages the page cache during reads as well as writes. This means that recently written data is already in the cache and doesn’t have to be read from the disk.

Pulsar has a different approach called tailing reads. When a consumer requests recent data, it is served not from the storage layer (BookKeeper via a network) at all, but from a local in-memory cache on the broker. Now this may sound like almost the same approach (cache is cache) or like totally different approaches (page cache maintained by the OS vs. in-memory cache on the application layer).

Therefore do your own performance tests for latency or throughput — whatever you want to tune — with real data and production specific parameters like a higher replication factor. That will tell you the truth. You can either use the Openmessaging Benchmark or write your own client preferably all deployed in your destination — be it cloud or on-premise.

And if you find any of these two systems not keeping up with your load — give us a shout — we’d be more than happy to assist you with your challenging project!

What Devops need to know — Kubernetes support

Pulsar’s documentation provides very good recipes how to install Pulsar on Amazon AWS, bare metal (single- or multi-cluster), DC/OS and Kubernetes. Especially the last option is interesting — Kubernetes tends to become a standard for new systems. If you want to understand how a Pulsar instance is setup I recommend to start with the bare metal installation and doing everything manually before starting to use tools like Terraform or Ansible (yes, the documentation provides a Git repository with all the magic!).

The Kubernetes installation uses Helm. There is no official binary Helm package, so one has to clone the whole Pulsar repository to install it — the Helm chart is a part of the source. Not a very convenient solution…

By default helm runs 18 pods, 9 services and 11 persistent volume claims consuming 290 GiB. It’s not a toy :)

In contrary to Kafka, there is no such a thing like a Pulsar Kubernetes Operator.

Remaining technical differences between Apache Kafka and Apache Pulsar

Links are at the end, but here’s a tl;dr:

Interoperability

One of most powerful Kafka features is the the Connect API which allows creating source and sink connectors. Basically one can read data from external systems and write it into Kafka (source) or consume from a Kafka topic and write data to an external system (sink).

This concept exists also in Pulsar and is called Pulsar IO.

The really big advantage of Pulsar are built-in connectors — it comes with a set of ready to run plugins allowing exchanging data with many popular systems. Of course one can develop custom connectors, too!

Finally Pulsar provides a Kafka adapter that allows to reuse application code written in Java against Kafka in Pulsar applications.

Pulsar Functions

When running a Kafka connector, one is able to change the processed messages — it’s done with Single Message Transformation.

Pulsar goes a step further and introduces Pulsar Functions. The concept is similar, but it can run as a standalone process, whereas in Kafka you can change the message only within the connector.

Basically it reads messages from one or multiple topics, applies some processing logic and writes the result to an output topic. This feature is able to apply some of the functionality provided by Apache Storm, Apache Heron or Apache Flink. Pulsar Functions operate on a single message only and aggregations are limited to counters passed via a context. As of now however Functions do not support more complex features like joins. Windowing is available only for Java and isn't properly documented yet.

Ordering

This one may be crucial and depends heavily on your use case. If you require ordering of events to be preserved, Kafka does support it via partitions. In case of Pulsar you can either employ partitions as well or use a key_shared subscription.

Retention

Although both Kafka and Pulsar can store messages for a longer period of time, it is Kafka that allows to set up smart compaction as a retention strategy compared to doing snapshots and leaving the original topic as is and on the other hand it is Pulsar that allows to delete messages on consumption. Most probably both hammers will do the work and storage capabilities is a topic to understand and to set up properly, rather than being used for making a final decision on which platform to choose.

Costs

Of course the costs depend on how the instance is deployed, but even the simplest production-ready setup requires a lot of resources. It’s much bigger compared to Kafka, so it’s also much more expensive. It’s an important factor when considering Pulsar for a rather small solution or for a startup.

Tiered Storage

Available in Pulsar, it lets you store older messages on cloud storage, which is cheaper than the brokers’ local SSDs. Also this allows to scale as large as your funds. In Kafka it’s still work in progress.

Monitoring

All Pulsar components expose metrics in the Prometheus format by default, yay! You don’t need to fight with things like jmx-exporter etc. Since you can consume Prometheus metrics in some other monitoring solutions like DataDog or Google StackDriver, this should meet almost everyone’s needs.

The Pulsar helm chart by default installs the Prometheus server and Grafana, which is a little bit of an overkill in my opinion (it can be disabled of course).

Multi-tenancy

This is a built-in feature at the topic level in Pulsar. Kafka requires to setup tenants, for example leveraging access control lists with dedicated topics.

Geo-replication

Built into Pulsar, it allows to protect against entire data center failures, as well as increase responsiveness for clients located in different parts of the world. With Kafka you have to rely on additional tooling like uReplicator, MirrorMaker2 or Confluent’s Multi-Region Clusters and Replicator.

Message acknowledging

In Kafka messages are acknowledged at the consumer group level for each partition separately. It’s not possible to process two messages from the same partition at the same time by two consumers of the same consumer group. That’s the point of partitioning. It guarantees message ordering. This is by design.

Pulsar on the other hand allows to throw in multiple consumers on one topic and fetch messages in parallel, which can be acknowledged individually. This is one of the features Pulsar was built for: topics as task queues also known as scheduling.

Specifically, if you require to perform a long-running task on each element, partitioning the Kafka topic makes sense. This allows to distribute the work to different consumers running in parallel. The problem it raises is that you need to know up-front how many partitions make sense, since this number is the upper limit of consumers, you can attach to a topic within a consumer group.

Pulsar’s selective acknowledgement is well suited for this particular use case, as explained in Creating Work Queues with Apache Kafka and Apache Pulsar.

For a more detailed comparison go to the lions’ caves, but when reading the posts on Kafkaesque, keep the author’s words in mind:

I know that it may sometimes sound like I am disrespecting Kafka, but I’m really just excited about Pulsar.

Confluent’s comparison:

Kafkaesque’s comparison:

Final thoughts

Pulsar has a lifetime ahead to prove to be a valuable gemstone in the solution architect’s pocket and we’re keeping an eye on it.

From a technical point of view, each platform has its dominance over the other one in a particular area, but none of them seems to be a deal breaker.

Obviously, you should use the right tool for the right job and when you have to separate tenants, store older and less required data on cheaper storage, easily replicate clusters in different geographical locations or require to consolidate queueing and streaming capabilities in one messaging system, then Pulsar has a clear advantage.

If it’s a matter of confidence, setup, docs and use cases as well as support, Kafka is your choice. It does have its quirks — it may occasionally eat your data or introduce a change disabling strong message ordering guarantees, but if you get to know them, read the release notes carefully and refresh your knowledge continuously, you end up working with a battle proven platform that won’t surprise you.

This post was written together with Grzegorz Kocur, our distributed systems and Kubernetes expert at SoftwareMill.

Get “Start with Apache Kafka eBook”

We’ve gathered our lessons learned while consulting clients and using Kafka in commercial projects.

--

--

Java consultant having experience with the Kafka ecosystem, Cassandra as well as GCP and AWS cloud providers. https://pl.linkedin.com/in/jaroslawkijanowski