Who and why uses Apache Kafka?

Michał Matłoka
SoftwareMill Tech Blog
6 min readJan 8, 2020

--

Picture by @laughayette

Over the last few years, Apache Kafka has been adopted by a lot of companies. Some claim that it is one of the most popular tools in the world. According to enlyft more than 20 500 companies reportedly use Kafka in their tech stacks, including Uber, Shopify, and Spotify.

What developers using Kafka (according to stackshare) value most about this technology is that it enables high throughput processing, is distributed and scalable, supports high-performance, is durable, and supports a publish-subscribe mechanism.

Kafka has a few applications, ranging from simple message passing, via inter-service communication in microservices architecture to whole stream processing platform applications. Today let’s dig into the details of which companies use Kafka and what are their use cases for it.

What is Kafka mainly used for?

Because of its fault tolerance and scalability, Kafka is often used in the big data space as a reliable way to ingest and move large amounts of data streams very quickly. Let’s look into particular use cases when Kafka is the first choice.

1. Stream processing — thanks to Kafka Streams you can build a streaming platform that transforms input Kafka topics into output Kafka topics. It ensures that the application is distributed and fault-tolerant.

2. Website activity tracking — this is the original use case that LinkedIn had in the past and which triggered the invention of Kafka. The company still uses Kafka to track activity data and operational metrics in real-time.

3. Metrics collection and monitoring — Kafka could be easily combined with a real-time monitoring application that reads from Kafka topics.

4. Log aggregation — you can publish logs into Kafka topics and that way you can store them in a Kafka cluster. Thanks to that, logs could be easily aggregated or processed using Apache Kafka.

5. Real-time analytics — Kafka could be used for real-time analytics as it’s able to process data as soon as they become available. It’s able to transmit data from producers to data handlers and further to data storage.

6. Microservices

Kafka can be used in microservices as well. They benefit from Kafka by using it as a centric intermediary that lets them communicate with each other. That way they benefit from the publish-subscribe model. It’s the receiver that decides asynchronously which events to receive. As a result, Kafka-centric applications are more reliable and scalable when compared with architectures without Kafka.

Who needs Kafka?

There are a number of benefits that you get with Kafka. First, it’s easy to understand how it works. Secondly, it’s reliable, fault-tolerant, and scales well. Let’s look into the details to see who can benefit from it the most.

Fault tolerance and scaling through clustering

If fault-tolerance is critical to your application and you know you might need to scale it in the future, Kafka’s clustered design is the answer. Kafka is able to automatically rebalance the processing load between the consumers. So even if message load increases drastically you’re safe, cause you can add more nodes and consumers provided you had predicted that before and defined enough partitions.

Part of the open-source ecosystem

Apache Kafka is an open-source distributed stream processing platform that’s been designed for relatively easy connectivity with other open-source systems using Kafka Connect. Thanks to that your architecture can benefit from the whole ecosystem of ready-made connectors.

Flexibility

You can use Kafka for the majority types of content (as long as the messages are not too big) as you can easily add different types of producers and consumers to the system. That’s why even if your business grows or changes you don’t need to rewrite the whole architecture.

Controllable access

What you get with Kafka is a built-in authentication structure. As producers and consumers are set up to write to and read from specific queues, you can manage access to data through one centralized mechanism.

Activision

Do you know the computer game series called Call of Duty? Activision is a company that created it. In one of their presentations it is shown what problems they had with Kafka and how they overcame them.

Summary

Activision has over 1000 topics in their Kafka Cluster and they handle between 10k and 100k messages per second. Various informations is being sent, including gameplay stats like shooting events and death location. Naming the topics was a challenge, but they came to the conclusion that the name should not express who produces or consumes the data, but rather should indicate the data type. Activision leverages various data formats and has its own Schema Registry written with Python and based on Cassandra. They use message envelops constructed with Protobuf.

Video

Slides

Tinder

Tinder, a dating app, leverages Kafka for multiple business purposes. Various processes are based on Kafka Streams. Among them you can find:

  • notifications scheduling for onboarding users (e.g. to upload a profile photo),
  • analytics,
  • content moderation,
  • recommendations,
  • user activation,
  • user timezone update process,
  • notifications,
  • and others.

Tinder sends over 86B events per day, what gives around 40TB data/day (info from 2018). Kafka allowed them to save over 90% comparing to AWS SQS/Kinesis. To get more information, take a look at their presentation from Kafka Summit 2018.

Watch the video at Confluent’s website.

Slides

Pinterest

Pinterest is being visited monthly by 200M+ users. There are over 100B+ pins and 2B+ ideas are searched monthly. Kafka is leveraged for multiple processes. Every click, repin or photo enlargement results in Kafka messages. Kafka Streams are used for content indexing, recommendations, spam detection but, what is most important, also for real-time ads budgets calculations.

Watch the video at Confluent’s website.

Slides

Uber

Uber requires a lot of real-time processing. They handle trillion+ (info from 2017!) messages per day over tens of thousand of topics. This amount results in data volume calculated in petabytes. Many processes are modeled using Kafka Streams, even so important ones like customer and driver matching, together with ETAs calculations or the auditing.

From technical point of view, Uber leverages their REST Proxy which is a fork of the Confluent one. They improved performance and reliability. Kafka is used mostly in at least once manner, so no data is lost. Batching capabilities are used to achieve better throughput. Data is divided into regional Kafka Clusters, which data is later replicated using their own tool called uReplicator.

Video

Slides

LinkedIn

Apache Kafka originates at LinkedIn. It was actually created to solve their challenges with systems related to monitoring, tracing and user activity tracking. Nowadays LinkedIn handles 7 trillion messages per day, divided into 100 000 topics, 7 M partitions, stored over 4000 brokers. They leverage REST Proxy for non-Java clients and Schema Registry for the schema management. LinkedIn has their own patches and releases of Kafka, so that they can get some features earlier, before they get accepted to the official packages.

Latest info about LinkedIn and how they leverage Kafka, can be found in the blogpost “How LinkedIn customizes Apache Kafka for 7 trillion messages per day” from October 2019.

Does Netflix still use Kafka?

Netflix leverages multi-cluster Kafka clusters together with Apache Flink for stream processing. They handle trillions of messages per day. What is interesting Netflix has chosen to use two replicas per partition, additionally enabling unclean leader election. This improves the availability but can cause a data loss. That is one of the reasons Netflix created their own tracing tool Inca, which can detect lost data. It offers related metrics and validate if pieces of infrastructure delivers the required processing guarantees (e.g. at least once).

To get to know more about Inca, take a look at the Netflix blogpost.

Conclusions

As you can see Kafka is used by various companies, often in business processes involving large amounts of data. It is able to scale almost linearly handling billions or trillions of messages. Quite often you can observe that Kafka Streams or Schema Registry are being used together with Kafka. If you’d like to see more of who uses Kafka, then take a look at Powered By section at Kafka documentation.

Get “Start with Apache Kafka eBook” — the essentials for implementing a production Kafka.

--

--

IT Dev (Scala), Consultant, Speaker & Trainer | #ApacheKafka #KafkaStreams #ApacheCassandra #Lightbend Certified | Open Source Contributor | @mmatloka