TMWL April’20 — Kafka consumer lag, monitoring microservices, Hasura

Programmers share what they’ve learned

Maria Kucharczyk

Published in

SoftwareMill Tech Blog

5 min readMay 12, 2020

Every month we share what we’ve learned in our team. In April Krzysiek, Michał and Marcin discovered:

how to really understand Kafka consumer lag — a metric mandatory to follow on most production systems.
how to handle memory issues when monitoring microservices.
how Hasura can take away the pain of repetition in CRUDs. No hand-crafting another collection of these HTTP endpoints — yay!

Kafka consumer lag nuances by Krzysiek Ciesielski

Strange monitoring readings on the production system in my project has led me to a very interesting investigation. As a result I was able to learn some important details about how to really understand Kafka consumer lag and what pitfalls can you expect when monitoring this value.

Consumer lag is one of the most important metrics, mandatory to follow on most production systems. According to official documentation: the records-lag-max metric is:

“The maximum lag in terms of number of records for any partition in this window. An increasing value over time is your best indication that the consumer group is not keeping up with the producers.”.

What does it mean exactly?

The lag metric is only reported for active consumers. It cannot be confused with the commit lag, which is a separate concept and means the difference between the last committed offset per consumer group and current offset.
The offset lag can be observed with the CLI tool kafka-consumer-groups. It shows output similar to following:

GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG 
flume  t1       0           1              3         2

For many applications Kafka consumer group identifiers don’t change dynamically, so tracking the commit lag seems to be very useful. In case a particular consumer group gets slow or totally down, we should see growing lag and be able to react to it. This value can be checked via CLI even when there are no active consumers.

However, this value is not available as a metric. What we can monitor is, as I mentioned, a consumer lag, whose value may be completely different.
First of all, when a consumer gets disconnected, there is no lag reported. Don’t be surprised when you get noticed about consumer errors but the lag is actually zero.
Another important issue is that many streaming solutions like fs2-kafka and alpakka-kafka buffer messages loaded from Kafka, and this buffering can reach 500 messages per stream stage. In a situation when your stream has buffered thousands of messages, but for some reason it processes these messages slowly, it would pause the consumer, which stops it from reporting the lag! As a consequence the offset commit lag may be quite large while the consumer lag metric is kept at zero level until the consumer gets unpaused. This may lead to weird peaks in the lag graphs in quite unexpected timestamps.

Make sure you understand the consumer lag metric correctly and how streaming may affect it.

Plus, this I found really inspiring:

I had a conversation with John A. De Goes, recorded Live on YT as a Tech Talk where John gave a detailed explanation about what is wrong with Functional Programming and how ZIO can address these pain points.

Kamon http4s and memory issues by Michał Matłoka

We develop a few microservices for one of our customers. Lately we added additional monitoring to one of them and suddenly… its heap started growing. Growing until at some point the instances got restarted due to memory issues. But why? We did the same thing as for other services!

The issue appeared to be the kamon-http4s used in the project with the default settings. This, as name says, integrates http4s services with Kamon monitoring. It allows adding quite quickly monitoring to every http api execution. The problem is, that every unique path, by default creates new Span and TagSet (see PathOperationNameGenerator ), which is stored in one TrieMap… forever.

private val _instruments = TrieMap.empty[TagSet, InstrumentEntry]

If you have REST API, where there are a lot of unique resource identifiers, and those identifiers are passed in request urls (like usually it is done in REST APIs), then every such unique id will cause creation of a new TagSet. TrieMap grows and grows.

However, it appears that kamon-http4s already has a mechanism to prevent such situations. It is possible to define path mappings and mark, which parts of paths are actually ids (here). This way there will be only a single entry in the TrieMap for a single endpoint.

As an alternative you can also define your custom HttpOperationNameGenerator (see here), it defines the values which are used in the keys of the TrieMap.

Quick & easy GraphQL API layer on top of your PostgreSQL DB with Hasura by Marcin Baraniecki

Exposing API endpoints for common CRUD (create, read, update, delete) operations on the data living in your database is a tedious & repetitive task. Moreover, a classic REST approach implies assumptions set in stone about the data model while reading. It is a very common scenario, that while all the client application wanted to read was just a user’s first and last name, it also got the date of birth, all user’s pets’ names, their shoe size and a list of friends’ ids.

Hasura takes the pain of repetition away, plus it exposes a GraphQL API on top of your PostgreSQL database.

Anytime I see new, fancy tools and their taglines like “(..) auto-generates (..) instantly” or “go live in seconds”, I’m always hesitant. But with Hasura, it was close to being a silver bullet for the proof-of-concept project I’ve been working on recently.

Not only am I free from hand-crafting another collection of these HTTP endpoints, but it also enables me to easily plug-in authentication & authorization rules for data reading & mutation. And yes, it all “just works” straight out of the box.

GraphQL API auto-generated on top of the data model means read-schema freedom, enabling multiple types of data consumers to ask for exactly what they want, no less, no more.

Remote schemas, on the other hand, enable creating a network of data hubs (including 3rd party GraphQL APIs) “hidden behind” a single Hasura instance, acting as an API gateway for client apps.

Try out Hasura in seconds, assuming you have access to a PostgreSQL database. It comes with a user-friendly interface running in the browser, allowing you to customize, control, and query all the data you’d like to serve easily from now on, possibly eliminating a great deal of boilerplate from your codebase.

And what have you learned in April? Let us know! :)

BTW, we are always looking for outstanding professionals to join our team!

Questions? Ask us anything about remote work, how does the cooperation with us look like, what projects do we have, or about anything else - on the dedicated Slack channel 💡

TMWL April’20 — Kafka consumer lag, monitoring microservices, Hasura

Programmers share what they’ve learned

Kafka consumer lag nuances by Krzysiek Ciesielski

Kamon http4s and memory issues by Michał Matłoka

Quick & easy GraphQL API layer on top of your PostgreSQL DB with Hasura by Marcin Baraniecki

Written by Maria Kucharczyk