Apache Kafka is an open-source stream-processing platform that aims to provide a unified, high-throughput, low-latency for handling real-time data feeds. This paper gives a brief overview of Apache Kafka.
Apache Kafka is a distributed data store optimized for ingesting and processing streaming data in real-time. Streaming data means the data that is continuously generated by thousands of data sources, which typically send the data records simultaneously. A streaming platform needs to handle this constant influx of data, and process the data sequentially and incrementally. In simple words, it provides a platform for building high-end new generation distributed applications.
Apache Kafka provides below three key capabilities, which are essential for a streaming platform.
- Produce and consume(Produce/Subscribe) to streams of data(similar to a message queue.)
- Store streams of data in a fault-tolerant way.
- Process streams of records as they occur.
Kafka nodes can run as a cluster on one or more servers also known as Kafka Cluster. Kafka stores data/messages as a streams of records in category known as topics, each record is associated with a key, a value and a timestamp.
- Reliability – In kafka the records are distributed, replicated and fault tolerant.
- High-throughput – Kafka can handle huge volume of data. Also, able to support message throughput of thousands of messages per second.
- Low Latency – It is capable of handling these messages with the very low latency of the range of milliseconds, demanded by most of the new use cases.
- Fault-Tolerant – One of the best advantages is Fault Tolerance. Kafka has resistance towards nodes/machines failure within the cluster.
- Durability – Here, durability refers to the persistence of data/messages on disk. Also, messages replication is one of the reasons behind durability, hence messages are never lost.
- Scalability – Kafka can be scaled without incurring any downtime on the fly.
3. Apache Kafka Architecture
Kafka has four core API’s as shown below,
Figure-1 : Apache Kafka Architecture
- Producer API – Allows an application to publish/produce a stream of records to one or more Kafka Topic.
- Consumer API – Allows an application to subscribe/consume to one or more topics and process the stream of records produced to them.
- Streams API – Allows an application to work as a stream processor, which means consuming an input stream of records from one or more topics and producing an output stream of records to one or more topics.
- Connector API – Allows to build an application to work with existing data system and seamlessly automate the addition of another application, For example capturing data from existing database and writing it to kafka topic and vice versa.
3.1. Kafka Components
Let us look into the core components of kafka through which it achieves reliable messaging service.
Figure-2 : Kafka Topic
1. Kafka Topic
Topic is a stream of messages belonging to a particular category, data is stored in topics. Topics are divided into partitions. For each topic, Kafka maintains atleast one partition. The order of each record/message in a partition is immutable. The topics by default is multi-subscriber, a topic can have zero or many consumers subscribed and waiting for the data to be written to it.
Partition is an ordered sequence of records, the order is immutable and data gets appended to the partition in the order as they are written. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.
Kafka producers are the message originator, producers can write messages to one or more Kafka topics. Producers send the message brokers, upon receiving the message broker simply appends the message to one of the partitions in the topic. Producers can also send messages to a particular partition of a topic.
Kafka consumers has to subscribe to one or more topics in order to consume/read the message from the topics. Kafka consumer read data from brokers for a particular topic to which it has subscribed.
Brokers are a group of systems which are responsible for maintaining the published data. The producers and consumers always gets connected to a list of brokers, where the data can be published/consumed. Each broker may have one or more partitions per topic. For example if there are X partitions in a topic and X number of brokers, each broker will have one partition as master and other partition as follower and these partition will be replicated across brokers.
Kafka uses ZooKeeper for managing and coordinating group of Kafka broker. ZooKeeper service is mainly used to form and maintain the kafka cluster. Zookeeper is mandatory for kafka to function, as it is used for cluster management of kafka broker, replication of the data across brokers, storing of consumer last read/processed partition index so that the consumer can resume from where it was last left. Each topic might have one or more partition, Zookeeper decides which partition will be a master in which broker(other brokers will have a replica of the same partition but acting as a follower).
Leader is a node(one of the brokers) within the kafka cluster, which is responsible for all reads and writes to a particular partition of a topic. Brokers node can be a leader or follower for a particular partition of a topic. Every partition has one broker acting as a leader and other brokers as followers.
Follower is a node within the kafka cluster, which follows leader instruction for a given partition of topic. Follower are means to keep the data replicated, If the current leader fails, one of the followers will automatically become the new leader. A follower always pulls messages and updates its own data store.
9. Consumer Group
Each consumer will have a unique Id knowns as groupID. Consumer Group is a set of consumers with same consumer group ID. One consumer instance reads the data from one partition in one consumer group, at the time of reading. Since, there is more than one consumer group, in that case, one instance from each of these groups can read from one single partition, there will be some inactive consumers, if the number of consumers exceeds the number of partitions.
4. Kafka Cluster
Kafka cluster typically consists of multiple broker nodes, which are used to achieve reliable replication of data. Kafka brokers are stateless, so they use zookeeper for maintaining their cluster state. Zookeeper is used to do leadership election of Kafka Broker and Topic Partition pairs. Zookeeper keeps propagating the changes of the topology to Kafka brokers, so that each node in the cluster knows when a new broker joined, a Broker died, a topic was removed or a topic was added, etc. Kafka needs more than one broker to form a cluster. Below diagram show the kafka cluster.
Figure-3 : Kafka Cluster
5. Typical Use-cases
5.1. Kafka as a Messaging System
Traditionally Messaging System has two models
- Queuing – Group of consumers may read from a server and each record goes to exactly one of consumers.
- Publish-Subscribe – The published records are broadcasted to all the consumers.
Kafka combines both of the above approaches to provide more flexibility in terms of how data is consumed. Kafka generalizes these two concepts with Consumer Groups. As with a queue the consumer group allows you to divide up processing over a collection of processes/consumers. The publish-subscribe model allows you to broadcast messages to multiple consumer groups.
The advantage of this model is that every topic has both these properties, it can scale processing as well as multi-subscriber.
Kafka provides both ordering and load balancing of records over a pool of consumer processes. This is accomplished by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer within the group. By doing this kafka ensures that the consumer is the only reader of that partition and consumes the data in order.
5.2. Kafka as a Storage System
Kafka can be used as a storage system. Data produced to the Kafka topics is written to disk and replicated for fault-tolerance, it also allows producers to wait on acknowledgement so that a write isn’t considered complete until it is fully replicated and guaranteed to persist even if the server written to failed. Thus kafka allows clients to control their read position, you can think of Kafka as a kind of special purpose distributed file system dedicated to high-performance, low-latency commit log storage, replication, and propagation.
5.3. Kafka as a Stream Processing
Stream processing is something like taking continuous streams of data from input topics, performs some processing on this input, and produces continuous streams of data to output topics. For instance, a retail application might take in input streams of sales and shipments, and output a stream of reorders and price adjustments computed off this data.
This can be achieved by simply by using producer and consumer APIs. But for complex processing of data Kafka provides a fully integrated Streams API. This allows in building applications that can do non-trivial processing or join streams together.
6. Why Kafka
Kafka is a publish–subscribe messaging system. It provides higher throughput, reliability, and replication which made this technology replacement of conventional message brokers(RabbitMQ). Kafka relies on the filesystem for storage and caching purposes, thus it’s fast. It prevents data loss and is fault-tolerant.
7. Need for Kafka
Kafka is a distributed platform for handling all the real-time data feeds. Kafka provides low latency message delivery and guarantees for fault tolerance in the event of machine failures. It has the ability to handle a large number of diverse consumers. Kafka’s transactions is very fast, performs 2 million writes/sec. Kafka persists all data to the disk, which essentially means that all writes go to the page cache of the OS.
8. Kafka vs Traditional Queueing System(RabbitMQ)
There are many applications which offer similar functionality as Kafka like ActiveMQ, RabbitMQ, Apache Flume. Let us compare Kafka and RabbitMQ
Kafka uses a partitioned log model, which combines messaging queue and publish subscribe approaches.
RabbitMQ uses a messaging queue.
Kafka support message retention and can be controlled by policy, for example messages may be stored for one day. The user can configure this retention window.
RabbitMQ does not support message retention. Its is based on acknowledgement, as soon as the message is consumed it is deleted from the queue.
Kafka combines both queuing and publish-subscribe model, It allows multiple consumers to subscribe to the same topic and each message be delivered to all the consumers.
RabbitMQ is based on queuing hence multiple consumers cannot receive all the same message, because message are removed as they are consumed.
In Kafka the topics are automatically replicated, but the user can manually configure topics to not be replicated.
In RabbitMQ messages are not automatically replicated, but the user can manually configure them to be replicated.
Kafka works on reliable log distributed processing. Also, there exist stream processing semantics built into the Kafka Streams.
In RabbitMQ consumer is just FIFO based, reading and processing sequentially.
Kafka is becoming the backbone of many applications which deal with huge data. Thus by using Apache Kafka as a message bus we achieve a high level of parallelism and decoupling between data producers and data consumers, making our architecture more flexible, scalable and adaptable to changing needs.