Get Real-Time IOT Data Analytics using Apache Kafka and Apache Spark

The Need for IoT Data Analytics

In today’s data-driven landscape, the ability to extract meaningful insights from relevant data has become essential for survival. And we don’t mean just for data analytics companies or technology firms. 

A vast majority of us interact with one form of an IoT device or application, sharing, communicating, and functioning as a cluster of interconnected lives. In fact, without the access to ‘right information at the right time,’ we’d have entire countries, large business enterprises, and even our very own individual lives spiraling out of control.

The universe of IoT data is huge and quite frankly scary if you think about it in sheer volume. But it is also bursting with such strong potential to transform our lives for the better if leveraged correctly. In this article, we’ll try to demystify the world of IoT analytics and show you how you can tap into real-time IoT data insights using Apache Kafka and Apache Spark.

Tap into the MQTT Telemetry Flow

One of the most essential concepts in IoT analytics is the configuration of various data reception points from different devices. Based on the nature of the endpoints, there are multiple data transportation mechanisms that can be used – and MQTT ((Message Queuing Telemetry Transport) ) is one of the most popular protocols being used currently.

An MQTT is a lightweight, efficient protocol that enables devices to publish their telemetry data. In order to move forward in our case, we’ll start with MQTT and it as the source of our IoT data.

So, step one is to create MQTT connectors. These are basically just MQTT clients that subscribe to specific topics where the IoT device publishes its data. By creating MQTT connectors, we establish real-time access to the telemetry data flowing through the MQTT broker.  The connectors usually do not perform analytics. Their job is to cleanse the data and push it to other sources such as Kafka to process and analyze the data.

How to use Apache Kafka to Manage the Data Flow

The next step is to use a distributed messaging infrastructure like Apache Kafka to line up the clean, transformed data for real-time analytics. The reasons for preferring Kafka are two-fold. One, it is powerful enough to handle the high volume of IoT data and ensure a fault-tolerant data flow. Two, Kafka topics, which can be partitioned, enable distributed and parallel processing of incoming data, enhancing the performance of our IoT analytics pipeline.

Contrary to traditional messaging systems, Kafka supports multi-subscribers and automatically balances the data flow to subscribers during failure. In our case, Kafka acts as an intermediary between the MQTT connectors and the IoT analytics platform. Not only does it guarantee seamless data ingestion, but its inherent distributed design also makes it very easy to scale.

Plug into Apache Spark for Real-Time Data Magic

Now that we’ve established a smooth flow of data from devices to MQTT to Kafka, it’s time to add our main analytical engine to the flow. Our first choice for live stream data processing is Apache Spark. Not only does it connect beautifully with Kafka, but it is also known for its continuous data processing capabilities.

In this example, we will integrate with Spark Streaming, which comes with an API for processing real-time streaming data. It is just one of the many libraries Spark offers for data analysis with parallel processing capabilities. This integration opens access to a Spark library that enables us to perform stream analytics on the data coming from Kafka topics or push it down to a database or file system for further processing.

Here in this example, once the connection is established, Spark Streaming receives the cleansed and transformed data and makes them available Dstream format. These Dstreams are then pushed to file systems and databases so that we can perform complex analytics and derive valuable insights.

The Code Snippet

It is now time to look into the code. 

We have used Spark 3.0.1. See the below code sample in SCALA.

We have used Maven for dependency management and build. Before starting the code, below are the dependencies used in pom.xml file.

This Pom.xml covers the dependency for Spark Core, Spark Streaming, Spark SQL, Kafka, and Hadoop.

Based on the project requirement additional dependencies can be included.

Let’s start a sample program with the main method.

Here is an explanation of what the above code does.

  • Main method accepts Kafka topic name as a parameter so that the program dynamically connects to Kafka topic when it starts.
  • Spark streaming batch size variable is initialized as 10 seconds.  (val streamInterval = Spark Streaming is a continuous batch of DStreams. So here it will be a batch of data for every 10 seconds. Every 10 seconds, the Spark processes the data that is received till then as a DStream.
  • Next, it is creating the Spark config object. See the call – .setMaster(“local[*]”) to run in the local machine. Then create a Spark context by using the Spark config object. Everything run in the Spark program is based on this Spark context created.
  • In this example, create a streamingContext object to initiate Spark streaming and the interval of streaming is set to 10 seconds.

The next step is to establish a Kafka connection and transform the json data in Dstream to Dataframe.

  • Define KafkaParams object with the necessary configuration to connect to a Kafka server. Here we specify Kafka bootstrap server IP, Kafka consumer group ID, and key and value deserializer class.
  • Kafkautil.createDirectStream method is called to subscribe to our Kafka topic. This method collects the message received in Kafka topic and returns it as DStream object.
  • foreachRDD method is used to read from each batch of DStream and perform the necessary transformation on the data. 

Here, each RDD data is in JSON format and it is transformed into Dataframe which matches the structure of JSON. StreamDF.show() will display the current RDD data as Dataframe.

Once the input JSON is transformed into Dataframe, it can be used for further analysis or saved as Dataframe into a file or database. There are multiple options for processing data further, and it is decided based on the particular analytical requirement. 

For our example, we are sharing the below code sample to show how to save a Dataframe into an HDFS storage in parquet format. Also, the file can be uploaded into cloud storage like AWS S3, Azure DateLake Storage, and the like.

Below is the sample code to save the Dataframe into a Postgresql table.