Apache Kafka has been the go-to publish-subscribe (pub-sub) messaging system for a while. It offers functionality for a wide range of enterprise use cases, along with a large ecosystem of tools and a dedicated community. But lately, upstart Apache Pulsar has been gaining ground.
Pulsar takes the best parts of Kafka and expands on them to solve problems that were out of Kafka's original design scope. But lately, these functionalities are coming in Kafka. In this session, we cover the differences and discuss how to choose based on use cases and the state of the art of both.
Kafka is more popular than Pulsar. Kafka is already in the Late Majority and even in the Laggards stages of the Technology Adoption life cycle. Many companies use Kafka, and it is considered a vital component of every big data architecture.
On the other hand, Pulsar is much newer, and hence the adoption is not as comprehensive as Kafka's. Although many companies do use Pulsar and it is growing in adoption. Yahoo open-sourced Pulsar specifically to address some of the problems with Kafka.
- Kafka architecture vs Pulsar Architecture
- Architectural Components & Consuming patterns
- Tiered storage & Geo-replication
- Message Size
- Kafka vs Pulsar Performance
- Community & Open-source
In Kafka, we have a Broker and a producer, writing and producing messages to Kafka broker. The broker stores messages in local storage or network storage, depending on the configuration. In most cases, it is local storage.
- One of Kafka's advantages is its high availability; it can copy messages to other brokers allowing replication of information across brokers, ensuring data is never lost if one of the brokers is faulty. The network connections can be scaled to add consumers.
- If there are not enough partitions set in advance, new instances cannot consume from Kafka. Adding more partitions and scaling up consumers is needed, which is not straightforward. There are workarounds, but those are just tricks and not out-of-the-box solutions.
- If we add a new broker, we need to move the replicas to the new Broker manually, interrupting the consumers to start reading from the new brokers - which is not easy to do.
Important note: to coordinate between different brokers, Kafka uses ZooKeeper. ZooKeeper will be replaced by Broker itself this year, but both need to be deployed until then. Since these are two different technologies, it is vital to know how to operate both.
Pulsar's architecture starts in a similar way to Kafka's. Pulsar has a producer, writing and producing messages to a Pulsar broker.
- The Broker in Pulsar is stateless - meaning it is not storing the information. Instead, it connects a different piece called Bookkeeper of Bookie to store information. The advantage of Pulsar Broker is that it can connect and send information to different BookKeeper.
- BookKeeper are relatively easy to scale. It is also easy to add more BookKeepers with their dedicated storage. Furthermore, because Brokers are stateless in Pulsar, new Brokers can be added and scaled up and down based on load requirements - this significant functionality differentiates Pulsar from Kafka.
- We can add new Consumers easily or scale down without facing the problems in Kafka.
- The downside in Pulsar's architecture is that it has two different network hubs to maintain; one for the Producers to Brokers and one for the Broker to the BookKeepers. As in Kafka, Pulsar also has a ZooKeeper component, but it is used more in Pulsar, so it is here to stay.
Kafka has Broker, and ZooKeeper will no longer exist soon. Less components mean less network hops, so Kafka's latency is better as there is no need to rely on the network two times. Furthermore, the footprint is smaller in Kafka because each component needs to be highly available. Managing Broker is different from managing BookKeeper as they are independent projects, although there are some similarities - they both work over the JVM.
On the other hand, consumer patterns are more flexible in Pulsar. Scaling up and down does not impact consumers - making it easier to scale Brokers and add more storage. These flexibilities can be highly advantageous when using Pulsar in Kubernetes.
It is possible to use Kafka in Kubernetes; there are some operators out there which are quite mature. However, there are not many advantages to deploying Kafka in Kubernetes. Scaling-up and scaling down the cluster is not possible, neither is it possible to restart the Broker in case of problems, delete it, and create it again, which are hugely problematic in Kafka.
Tiered storage is available in Pulsar and is more mature than in Kafka. It supports S3 and several other object stores as well. Moreover, Tiered storage is part of the Pulsar product.
Tiered storage is relatively new in Kafka, launched recently. Tiered storage is part of Kafka's commercial offering only, and therefore is not open source yet.
Geo-replication is better in Pulsar. Geo-replication in Pulsar supports multiple topologies, such as active-standby, full mesh, and edge aggregation. Whereas in Kafka, Geo-replication is more complicated. It is hard to maintain and is quite costly in terms of operation and maintenance. Some replicators are available in commercial offerings, similar to Mirror Maker 2, but they also come with limitations in maintenance and cost.
Pulsar is multi-tenant. Pulsar contains namespaces - similar to Kubernetes namespaces, separating topics from a logical point of view, Producer and Consumer point of view and manages those namespaces independently. Whereas Kafka is single-tenant, containing only Topics.
Message size is one of the main benefits of Pulsar over Kafka. By default, Message size can be up to five megabytes in Pulsar. In case the message size increases over five megabytes, Pulsar has a feature enabling the Producer to automatically split the messages into smaller sizes. The Consumer puts these messages back together automatically to receive the initial message.
In comparison, Kafka's ideal scenario is when the message sizes are small, around one kilobyte. When messages are under one kilobyte, Kafka works efficiently. However, if the message size is over one megabyte, the Producer will not launch the action. There are configurations to allow larger size messages in Kafka, but not ideal.
Benchmarks by both Kafka communities and Pulsar communities will show that each is better than the other in terms of performance. Compared with a database, both Kafka and Pulsar are equally impressive. Also, if there are no specific requirements in terms of latency, performance is not the key.
To truly determine the best technology for the specific use case, it is best to do benchmarking based on that use case. For most use cases, on the other hand, it is best not to choose based on the performance benchmarks but based on all the other factors discussed above.
Kafka's community is larger and more mature than the Pulsar community. However, since the Pulsar community is relatively new, the community is exciting, and most importantly, the level of support from Pulsar's core maintainers themselves is great. Overall the support from both communities is excellent.
Pulsar is part of the Apache project under the Apache license. Since Pulsar is new and the adoption is low, absolutely everything is open source as far as our knowledge goes.
Kafka itself, the Broker, and many of its components are part of the Apache Software Foundation. However, some essential components of Kafka, including Schema Registry, are not part of the Apache Software Foundation.
Both Kafka and Pulsar are equally good. They are just in different places in terms of community and adoption. We can expect to see similar changes to Kafka's in Pulsar as its adoption grows.
Join us for future Ask Me Anything sessions! More details on our website.