Kafka – the data conqueror

There is a huge amount of data being used in Big Data. And no matter how we approach this topic, two main challenges always come to the fore. The first – which is obvious – is to collect a large amount of data. The second – is their analysis. To meet them, a messaging system is required. And this is where Kafka comes in. Today, some preliminary information about this platform. In the next publication – Kafka in a practical aspect.

Kafka – a first look

Designed for high-bandwidth distributed systems, Apache Kafka is a very good replacement for traditional messaging intermediaries. Compared to other similar systems, platform has many additional advantages.

Firstly, better throughput. Secondly, built-in partitioning. Thirdly, replication. Fourthly is fault tolerance. In conclusion, this makes this system an ideal solution for large-scale messaging.

Messaging system

The messaging system is responsible for passing data from one application to another, allowing the latter to focus on data. They don’t have to worry about how to make it available. Distributed messaging is based on the concept of reliable message queues. Messages are queued asynchronously between the client application and the messaging system. Two types of messaging patterns are available. One, is the peer-to-peer messaging system. The other is a publisher-subscriber messaging system. Kafka and most messaging systems are pub-sub compliant.

The publisher-subscriber

In a publisher-subscriber system, messages are stored in a topic. Unlike peer-to-peer systems, consumers can subscribe to one or more topics and benefit from all messages. In a publisher-subscriber system, news producers are called publishers and news consumers are called subscribers. 

What is Kafka?

Apache-Kafka
Apache Kafka is currently used by Netflix, Pinterest, and Airbnb, among others.

Apache Kafka is a distributed publisher-subscriber messaging system and a powerful queue that can handle large amounts of data and allow messages to be passed from one endpoint to another. Kafka is suitable for offline and online message consumption. Messages are stored on a disk and replicated in a cluster to prevent data loss. Kafka is built on top of the ZooKeeper synchronization service. It integrates perfectly with Apache Storm and Spark for real-time analysis of streaming data.

Benefits

Here are some of the benefits (advantages) of platform using:

  • Reliability – Kafka is distributed, partitioned, replicated, and fault-tolerant.
  • Scalability – messaging system can be easly scaled without downtime.
  • Persistence – Kafka uses a distributed approval log, which means messages stay on disk as fast as possible, making them persistent.
  • Performance – Kafka has high throughput for publishing and subscribing to messages. It maintains stable performance even when storing many terabytes of data (level) messages.

Kafka is fast and does not cause downtime or data loss.

Usage examples

Kafka is used in many cases today. For example:

  • Metrics – Kafka is often used to monitor operational data. What are they? For example, aggregating statistics from distributed applications to generate a centralized source of operational data.
  • Log aggregation solution – Kafka is used across an organization to collect logs from multiple services and make them available to multiple consumers in a standard format.
  • Streaming processing – popular platforms such as Storm and Spark Streaming can read data from a topic. Then process it and save the processed data in a new topic for use by users and applications. The high durability of Kafka is also useful for streaming processing.

Kafka is a unified platform that supports all data sources in real-time. Above all, it supports low latency messaging and guarantees fault tolerance in case of machine failure. It can support a large number of different consumers. Kafka is fast, performing 2 million writes per second. It stores all data on disk, which means that writes go to the operating system’s page cache (RAM).