Kafka for System Design Interviews

By the Co-founder of www.hellointerview.com

5 min readJul 1, 2024

Intro

There is a good chance you’ve heard of Kafka. It’s popular. In fact, according to their website, it’s used by 80% of the Fortune 100. If it’s good enough to help scale the largest companies in the world, it’s probably good enough for your next system design interview. In fact, it’s one of the top 5 technologies we see used in design interviews.

Apache Kafka is an open-source distributed event streaming platform that can be used either as a message queue or as a stream processing system. Kafka excels in delivering high performance, scalability, and durability. It’s engineered to handle vast volumes of data in real-time, ensuring that no message is ever lost and that each piece of data is processed as swiftly as possible.

In this deep dive, we’re going to take a top down approach. Starting with a zoomed out view of Kafka and progressing into more and more detail. If you know the basics, feel free to skip ahead to the more advanced sections.

A Motivating Example

It’s the World Cup (my personal favorite competition). And we run a website that provides real-time statistics on the matches. Each time a goal is scored, a player is booked, or a substitution is made, we want to update our website with the latest information.

Events are placed on a queue when they occur. We call the server or process responsible for putting these events on the queue the producer. Downstream, we have a server that reads events off the queue and updates the website. We call this the consumer.

Now, imagine the World Cup expanded from just the top 48 teams to a hypothetical 1,000-team tournament, and all the games are now played at the same time. The number of events has increased significantly, and our single server hosting the queue is struggling to keep up. Similarly, our consumer feels like it has its mouth under a firehose and is crashing under the load.

We need to scale the system by adding more servers to distribute our queue. But how do we ensure that the events are still processed in order?

If we were to randomly distribute the events across the servers, we would have a mess on our hands. Goals would be scored before the match even started, and players would be booked for fouls they haven’t committed yet.

A logical solution is to distribute the items in the queue based on the game they are associated with. This way, all events for a single game are processed in order because they exist on the same queue. This is one of the fundamental ideas behind Kafka: messages sent and received through Kafka require a user specified distribution strategy.

But what about our consumer, it’s still overwhelmed. It is easy enough to add more, but how do we make sure that each event is only processed once? We can group consumers together into what Kafka calls a consumer group. With consumer groups, ach event is guaranteed to only be processed by one consumer in the group.

Lastly, we’ve decided that we want to expand our hypothetical World Cup to more sports, like basketball. But we don’t want our soccer website to cover basketball events, and we don’t want our basketball website to cover soccer events. So we introduce the concept of topics. Each event is associated with a topic, and consumers can subscribe to specific topics. Therefore, our consumers who update the soccer website only subscribe to the soccer topic, and our consumers that update the basketball website only subscribe to basketball events.

Basic Terminology and Architecture

The example is great, but let’s define Kafka a but more concretely by formalizing some of the key terms and concepts introduced above.

A Kafka cluster is made up of multiple brokers. These are just individual servers (they can be physical or virtual). Each broker is responsible for storing data and serving clients. The more brokers you have, the more data you can store and the more clients you can serve.

Each broker has a number of partitions. Each partition is an ordered, immutable sequence of messages that is continually appended to — think of like a log file. Partitions are the way Kafka scales as they allow for messaged to be consumed in parallel.

A topic is just a logical grouping of partitions. Topics are the way you publish and subscribe to data in Kafka. When you publish a message, you publish it to a topic, and when you consume a message, you consume it from a topic. Topics are always multi-producer; that is, a topic can have zero, one, or many producers that write data to it.

So what is the difference between a topic and a partition?
A topic is a logical grouping of messages. A partition is a physical grouping of messages. A topic can have multiple partitions, and each partition can be on a different broker. Topics are just a way to organize your data, while partitions are a way to scale your data.

Last up we have our producers and consumers. Producers are the ones who write data to topics, and consumers are the ones who read data from topics. While Kafka exposes a simple API for both producers and consumers, the creation and processing of messages is on you, the developer. Kafka doesn’t care what the data is, it just stores and serves it.

Importantly, you can use Kafka as either a message queue or a stream. Frankly, the distinction here is minor. The only meaningful difference is with how consumers interact with the data. In a message queue, consumers read messages from the queue and then acknowledge that they have processed the message. In a stream, consumers read messages from the stream and then process them, but they don’t acknowledge that they have processed the message. This allows for more complex processing of the data.

Head over to Hello Interview to read the rest of the breakdown!

(don’t worry, it’s totally free)

Kafka for System Design Interviews

By the Co-founder of www.hellointerview.com

Intro

A Motivating Example

Basic Terminology and Architecture

Written by Evan King

Responses (1)