Kafka in the Big Data ecosystem

Ovidiu Petridean
Senior Software Developer @ SDL Research

PROGRAMMING

This article presents an overview of the concepts, which form the Kafka system, and the role they play in the Big Data ecosystem.

The Big Data world is getting more and more popular and the interest for the technologies in this ecosystem is growing. Analytics is often described as one of the most intricate challenges associated with Big Data. We will not focus on why this challenge arises and how it can be approached, because, before performing analytics on data, data has to be integrated and made available to enterprise users. That is where Apache Kafka comes in.

Apache Kafka

LinkedIn originally developed Kafka. Kafka is an open-source system for managing real-time streams of data from websites, applications and sensors.

Kafka is a high-quality open-source project and this has drawn a lot of attention from many contributors. There has been a significant growth in the rate at which this system has been adopted. Some of the most well-known users are Uber, Twitter, Netflix, Yahoo or Cisco.

Apache Kafka is a pub-sub system built as a distributed commit log.

The basic messaging terminology includes:

Topcis = feeds of messages organised in categories
Processes that publish messages to Kafka topic producers
Processes that subscribe to topics and process the feed of published messages from consumers
Kafka is run as a cluster consisting of one or more servers, each of which is called a broker.

An overview of the architecture can be presented like this:

Collecting the data

The consensus is that Apache Kafka and similar systems are part of the evolution and the diversification of the Big Data ecosystem. A vice-president and principal analyst with Forrester Research reports that, up until 2013 or so, "big data was all about massive quantities of data stuffed into Hadoop; now, if you're not doing that, you're already behind the power curve".

This statement stresses the obvious. Collecting data is crucial for analytics. The large number of systems that generates data makes the collection process a challenge. Kafka allows you to move large amounts of data and makes them available as a real-time stream, ready for user consumption.

The Internet of Things is one of the major contributors when it comes to generating large amounts of data. It is clear that this amount of data is continuously growing, given that there are more and more devices connected to the Internet. As the number of sensors and devices we are wearing is increasing, we will need to move, to catalog and to analyze larger amounts of data.

Up to 2014, it was all about Hadoop. Then it was about Spark. Now, it is about Hadoop, Spark and Kafka. These are the three equal peers in the data-ingestion pipeline of modern analytical architecture.

Feeding specialized systems

There is a large variety of data types ranging from transaction records, to user tracking data, operational metrics, service logs, etc. Often, the same data set needs to be fed to multiple specialized systems. For example, while application logs are useful for offline log analysis, it is equally important to search individual log entries. It, therefore, becomes infeasible to build a separate pipeline, to collect each type of data and directly feed it into each relevant specialized system. While Hadoop typically holds a copy of all types of data, it is impractical to feed all other systems off Hadoop since many of them require more real-time data than what Hadoop can provide.

Another example is looking at the user data that is generated. We could use Kafka for inter-service communication, but also for data consumption, while we could use Hadoop for storage and further analysis. In this way, a single data stream is processed by two consumers, in two different ways.

Kafka can also store high-volume data on commodity hardware. Kafka is designed as a multi-subscription system. The same published data set can be consumed multiple times. It persists data to disks and can deliver messages to both real-time and batch consumers at the same time, without performance degradation.

Often, Kafka acts as a central nervous system that collects high-volume data in real time.

This stream data platform is presented in the next image

One of the great advantages of this platform is that we can always add a new specialized system to consume data published to Kafka. This is significant for the development prospects of the Big Data ecosystem. We will probably see more platforms using a pub-sub system, like Kafka, which will play an important role, as more companies require real-time, high-volume data processing. One consequence is that we may have to rethink the data curation process. Currently, much of the data curation, such as schematizing the data and evolving the schemas is deferred until after the data is loaded into Hadoop. This is not ideal for stream data platforms, because the same data curation process would have to be repeated in other specialized systems as well. It is better to solve data duration issues early, when the data is ingested into Kafka.

Kafka Ecosystem

The Kafka Ecosystem has many subordinate systems: Stream Processing, Hadoop integration, Search and Query, AWS integration, Logging or Metrics.

Stream Processing

Storm - A stream-processing framework
Samza - A YARN-based stream processing framework
Storm Spout - Consumes messages from Kafka and emits them as Storm tuples
Kafka-Storm -Kafka 0.8, Storm 0.9, Avro integration
SparkStreaming - Kafka receiver supports Kafka 0.8 and above

Hadoop Integration

Camus - LinkedIn's Kafka=>HDFS pipeline. This one is used for all data at LinkedIn, and it works great.
Kafka Hadoop Loader - A different take on the Hadoop loading functionality, different from what is included in the main distribution
Flume - Contains Kafka Source (consumer) and Sink (producer)
Kangaroo - A tool that consumes data from Kafka using various formats and compression codecs

Search and Query

ElasticSearch - This project, Kafka Standalone Consumer, will read the messages from Kafka, and then process and index them in ElasticSearch.
Presto - The Presto Kafka connector allows you to query Kafka in SQL using Presto.
Hive - Hive SerDe allows you to query Kafka (Avro only for now) using Hive SQL.

Logging

Syslog producer - A syslog producer that supports both raw data and protobuf with meta data for deep analytics usage
LogStash Integration - Integration Kafka - Logstash

Conclusions

It has become obvious that the development of Big Data technology is impressive and that the adoption rate for this technology is growing.

Given the Big Data context, which, of course, is not limited to Hadoop, Kafka plays an important role. Its growing popularity is mostly the result of being a high-quality project supported by some of the most important players in the industry.

Due to the large number of sensors that we have now including the ones in smartphones, smart-watches, fitness devices or smart homes, it is clear that the amount of generated data is increasing.

This is where Kafka and the other Big Data systems play a crucial role, in that they help each other. Hadoop satisfies the need to store data in HDFS and to perform analysis, while Kafka is the one that provides high speed in terms of transportation and data distribution to multiple locations.