In today’s data-driven landscape, the ability to extract meaningful insights from relevant data has become essential for survival. And we don’t mean just for IOT data analytics companies or technology firms.
A vast majority of us interact with one form of an IoT device or application, sharing, communicating, and functioning as a cluster of interconnected lives. In fact, without the access to ‘right information at the right time,’ we’d have entire countries, large business enterprises, and even our very own individual lives spiraling out of control.
The universe of IoT data is huge and quite frankly scary if you think about it in sheer volume. But it is also bursting with such strong potential to transform our lives for the better if leveraged correctly. In this article, we’ll try to demystify the world of IoT analytics and show you how you can tap into real-time IoT data insights using Apache Kafka and Apache Spark.
One of the most essential concepts in IoT analytics is the configuration of various data reception points from different devices. Based on the nature of the endpoints, there are multiple data transportation mechanisms that can be used – and MQTT ((Message Queuing Telemetry Transport) ) is one of the most popular protocols being used currently.
An MQTT is a lightweight, efficient protocol that enables devices to publish their telemetry data. In order to move forward in our case, we’ll start with MQTT and it as the source of our IoT data.
So, step one is to create MQTT connectors. These are basically just MQTT clients that subscribe to specific topics where the IoT device publishes its data. By creating MQTT connectors, we establish real-time access to the telemetry data flowing through the MQTT broker. The connectors usually do not perform analytics. Their job is to cleanse the data and push it to other sources such as Kafka to process and analyze the data.
The next step is to use a distributed messaging infrastructure like Apache Kafka to line up the clean, transformed data for real-time analytics. The reasons for preferring Kafka are two-fold. One, it is powerful enough to handle the high volume of IoT data and ensure a fault-tolerant data flow. Two, Kafka topics, which can be partitioned, enable distributed and parallel processing of incoming data, enhancing the performance of our IoT analytics pipeline.
Contrary to traditional messaging systems, Kafka supports multi-subscribers and automatically balances the data flow to subscribers during failure. In our case, Kafka acts as an intermediary between the MQTT connectors and the IoT analytics platform. Not only does it guarantee seamless data ingestion, but its inherent distributed design also makes it very easy to scale.
Now that we’ve established a smooth flow of data from devices to MQTT to Kafka, it’s time to add our main analytical engine to the flow. Our first choice for live stream data processing is Apache Spark. Not only does it connect beautifully with Kafka, but it is also known for its continuous data processing capabilities.
In this example, we will integrate with Spark Streaming, which comes with an API for processing real-time streaming data. It is just one of the many libraries Spark offers for data analysis with parallel processing capabilities. This integration opens access to a Spark library that enables us to perform stream analytics on the data coming from Kafka topics or push it down to a database or file system for further processing.
Here in this example, once the connection is established, Spark Streaming receives the cleansed and transformed data and makes them available Dstream format. These Dstreams are then pushed to file systems and databases so that we can perform complex analytics and derive valuable insights.
It is now time to look into the code.
We have used Spark 3.0.1. See the below code sample in SCALA.
We have used Maven for dependency management and build. Before starting the code, below are the dependencies used in pom.xml file.
This Pom.xml covers the dependency for Spark Core, Spark Streaming, Spark SQL, Kafka, and Hadoop.
Based on the project requirement additional dependencies can be included.
Let’s start a sample program with the main method.
Here is an explanation of what the above code does.
The next step is to establish a Kafka connection and transform the json data in Dstream to Dataframe.
Here, each RDD data is in JSON format and it is transformed into Dataframe which matches the structure of JSON. StreamDF.show() will display the current RDD data as Dataframe.
Once the input JSON is transformed into Dataframe, it can be used for further analysis or saved as Dataframe into a file or database. There are multiple options for processing data further, and it is decided based on the particular analytical requirement.
For our example, we are sharing the below code sample to show how to save a Dataframe into an HDFS storage in parquet format. Also, the file can be uploaded into cloud storage like AWS S3, Azure DateLake Storage, and the like.
Below is the sample code to save the Dataframe into a Postgresql table.
Expeed Software is one of the top software companies in Ohio that specializes in application development, data analytics, digital transformation services, and user experience solutions. As an organization, we have worked with some of the largest companies in the world and have helped them build custom software products, automated their processes, assisted in their digital transformation, and enabled them to become more data-driven businesses. As a software development company, our goal is to deliver products and solutions that improve efficiency, lower costs and offer scalability. If you’re looking for the best software development in Columbus Ohio, get in touch with us at today.