Real-Time Stream Processing in the Big Data Realm

Adela Mureșan
Senior Big Data Developer @ VE Interactive

PROGRAMMING

We live in a world that is governed by data, a volume of data that grows exponentially each year. No wonder that we hear more and more about "Big Data technologies". If, at the very beginning, the Big Data ecosystem was simply a collection of tools meant to cope with the data volume, now it also has to address the speed problem.

As everything happens so fast around us and data driven decisions are taken on the spot, the real-time processing of events takes the lead stage in this whole Big Data realm.

The paper takes a look at some options to setup such a system, capable of ingesting and processing a stream of data in real time.

Overview of Real-Time Processing Systems

First, we must make clear what real-time processing means. Real-time processing is the term used for a system that takes some data and does something meaningful with it, in a fixed and short amount of time. When we speak of real time, sometimes we also refer to near real-time processing. What does this mean? It means that there is a small delay between the time when the action happens and the moment it is handled. This small delay is often in terms of milliseconds or seconds which, in most cases, can be considered to be instant. This high velocity is what defines the real-time processing systems.

For the whole real-time stream processing system to function, there must be three defined stages: data creation, data ingestion and data processing.

What data is collected? Everything. We are surrounded by connected devices that send data to a system that takes it and computes some intelligent actions or simply some reports out of it, to understand better the whole setup. Since 2013, more data has been created than in the entire history of the human race. From user clicks, user actions on a site, the car, GPS tracks to animal behavior, city sensors and not only, the data is collected. Different companies have realized the value of data and decided to gather it.

What does real-time processing involve? After the data is collected through some ingestion systems, all of this data is later used for analytics. Many times, the processing has to happen fast so that a system can react to changing conditions. This is required for in-site optimizations, display, trading, fraud detection, system monitoring and many more.

Reacting fast to the stream of incoming data is not feasible with the traditional databases. Luckily, there are many frameworks, optimized for real-time or near real-time processing, based on the capabilities offered by Hadoop and the Hadoop ecosystem.

Data Ingestion

The user action and sensor data is collected and often sent to a server directly or to a messaging queue system. Data can be streamed in real time or ingested in batches. With the stream data, each event is captured as it is delivered by the source client. In the batch mode, on the other hand, the data is stored and sent as small batches containing all the events for that particular time window.

Figure 3 – Kafka system

There are two options to gather the data: either the client can have a piece of code that sends the events (embedded within itself), or the client is passively logging events, and the processing system identifies the new data and, using some data ingestion tools, collects the new information.

Several ingestion tools ease the data ingestion and we can have a brief look at some of them:

Apache NiFi
Apache Flume
Amazon Kinesis
Apache Kafka

Apache NiFi

Apache NiFi is one project that is getting more and more attention in the Hadoop world. It is a versatile tool that can help us perform data ingestion and manipulation from various data sources easily. It is an excellent source for ingesting real-time streaming data, and for filtering it as needed. It can also work as a messenger and send the data to the desired location.

Figure 2 – An example of a NiFi flow with different processors

NiFi can be installed in multiple modes, from very simple devices with cloud capabilities up to a cluster of machines, in case of more complex or resource intensive NiFi tasks. After installation, it will offer a web-based interface where the user can define a sequence of steps.

Each step is served by something called NiFi processors. There are more than 180 processors that are easily configurable to fetch, transform or send data. Basically, NiFi was designed to automate the data flow, whether it fetches data using HTTP, or it reads files exposed through FTP, or it tweets real-time Twitter feeds. After the fact, it parses, it transforms and it saves the data.

Apache Flume

Flume is another Apache-powered project that is designed to efficiently collect and move large data sets. Based on the architecture of data flow streaming, it is a good fit for real-time data collection.

Apache Kafka

In the top of the tools used when building real-time steaming systems we have Apache Kafka. Kafka was designed to be a distributed streaming platform that acts as a message broker handling a queue of messages. It provides high-throughput, and low-latency management for real-time data feeds. It enables us store streams in a fault tolerant way. It also enables us to process the data as it occurs.

Figure 3 – Kafka system

Kafka offers many producers and consumers the possibility to subscribe to streams of records. The possibility of subscribing to real-time events makes Kafka a good fit for very low latency pipelines.

Processing Tools

In the past years, real-time processing has taken the lead in the Big Data realm. A number of powerful and open-source platforms have been created to facilitate this. Three of the most notable ones are: Apache Storm, Apache Spark with Spark Streaming and Apache Flink.

Apache Storm

Apache Storm has become an Apache top level project in September 2014. Storm is designed to be as real time as possible. Its' architecture is based on connecting "spouts" (for input streams) and "bolts" (for processing and output). Together, the spouts and the bolts form what it is called a topology.

Once a topology is started, the spouts bring data into the system and immediately hand it off to the bolts. This architecture makes Storm well suited for stream processing. The are many spouts available for different data sources from Apache Kafka, Twitter streaming API and many more, or we can even define our own custom spout.

Spark Streaming

Apache Spark was created by Matei Zaharia, born in Romania. During his Ph.D., he was a UC Berkeley student and started working on the Spark project. Later, he donated the codebase to the Apache Software Foundation, which maintained Spark ever since. Since its initial release in 2014, as a top level Apache project, Spark has constantly evolved, with more than 1000 contributors, to its open-source project. One of the best benefits of Spark is the different modules that it offers:

Spark Core - used for batch processing
Spark Streaming - for streaming data
Spark SQL - for SQL like queries
SparkML - for machine learning
GraphX - for graph processing

Spark Streaming, similar to Storm is designed for massive scalability and is offering the means to process in real time large volume of data.

The main difference between Spark and Storm is that Storm reacts immediately to a change of data, while Spark's model leverages the notion of micro-batch processing. What this mean is that it handles the events in near real time. Instead of processing the streaming data one record a time, Spark accepts data in parallel and buffers it in the Spark's cluster nodes memory. The processing happens in micro-batches, in a time window as defined in the Spark job configuration.

What to choose between Spark and Storm?

If we were to start a project that we know that it will handle only real time streaming of events and if latency is key, then Storm would be a better fit.

On the other hand, if we already have a Hadoop or Mesos cluster and the project is more complex, involving also machine learning, SQL queries or graph processing, then Spark would be more suitable. Also for Spark, the team could already come with the experience of batch processing using Spark Core. Spark Streaming is very similar to Spark core, and it uses the same concepts of RDDs (Resilient Distributed Datasets) from Spark. The only difference between a streaming and a batch processing job is the setup, while all the other concepts remain the same.

Apache Flink

Apache Flink is another Big Data processing framework, which also started as a research project. In June 2015, Flink had its first release as an Apache top-level project. During the last months, it started to capture the attention of many Big Data developers.

While Spark processes data in micro-batches, Flink is designed to process streams of data in real time. Flink's execution engine is similar to Storm for the stream processing. However, it also offers the possibility of batch processing, just as Spark, a capability that Storm lacks.

The downside of Flink is that, since it is more recent, it is still under consolidation and it does not have the same adoption rate as Spark or Storm yet. However, it is a tool with the potential to evolve.

Conclusion

The volume of data is rapidly growing and the demand for real-time stream processing is increasing, as well. There are many powerful, paid or open-source tools, frameworks and platforms that are addressing this need. They are all constantly improving, facilitating the development of a stream processing system. We can predict the wide adoption rate of these technologies by many companies that enter the Big Data realm.