HPE Streaming Real-time Analytic Platform (HS-RAP) ...

Integrating an end-to-end Big-data Real-time Streaming Intelligent Analytic solution, which start from Data Ingestion to Intelligent Analytic processing capability; to Real-time Visualization feature, is a challenge in IT professional. It required different type of intensive knowledge from multiple domains.

HS-RAP is an integrated technologies platform for rapid development of analytic PoC or Solutions. The technology domain comprised of the Big-data, Real-time Intelligent Streaming Analytic and Real-time Visualization, focusing in 3 types of problem model: Prediction, Classification and Outlier Detection. HS-RAP is adopting RandomForest Linear Regression and Logistic Regression as the prediction component. It has Support Vector Machine and Random Forest Classifier as the classification component. HS-RAP also contains a non-supervised Outlier Detector component, developed by leveraging on the Multi-variate Linear Regression module provided in Apache Spark MLlibs.

The platform consist of 3 layers, which are Data Ingestion Layer, Data Processing Layer and Data Visualization Layer. Data Ingestion Layer contains components provide real-time streaming data to the Data Processing Layer. Data Processing Layer consumes ingested data from multipe Kafka topics in real-time and process the streaming data using complex Machine Learning algorithm. Data Visualization Layer contain auto refresh graph components to view the analyzed results in real-time. The diagram below illustrate the platform overall architecture.

The table below indicates the specification and function of the servers used in developing the platform.
The diagram below illustrate the system architecture of the platform and their interaction details.

Data Processing Layer : Prediction Component

The Prediction Component in HS-RAP is adopting the RandomForest Linear Regression and Logistic Regression as the prediction component. These components were developed based on MLlib supplied from Spark Core engine.

The Prediction Component will learn and adapt to the history data, (from 0 sec to t sec), and able to predict (t+1) sec of value for decision making and trend analysis. The video on the left hand side shows the visualization. The Green color dot present the actual value of the model, and the Yellow dot present the predicted value.

It's being observed at the initial state, the Yellow and Green dots are not aligned. This means the predicted values is not same as the actual value from the model. However, after certain period of time, the Yellow dots are overlapping on the Green dot. It means the component has learned the model and start to predict the correct value.

The graph below showing the Mean Square Error (MSE) value, which being normalized to 0 and 1 range. The MSE value reduce to almost 0 at the later stage to the simulation indicate the Predict component able to learn and adapt to the historical model.

The diagram below shows the results of the Prediction models that implemented using different algorithms. The result showed by Random Forest Regression is still promissing as there are only some high MSE values obtained. However, the result shown using Linear Regression not that promissing. It may because of wrongly implementation at code level. Similar Linear Regression algorithm used, but implementation as Streaming Regression, shows the perfect result.

Data Processing Layer : Classification Component

The Classification Component in HS-RAP adopting Support Vector Machine (SVM) and Random Forest Classification as the algorithm. The implementation of these algorithms has been provided in Spark MLlib optimized in using the parellel computational power in Spark Cluster.

One of the use cases is using the Classification Component in indentifying the gender of the voice streaming to the platform. The Classifier is being trained using 3168 recorded voice samples, collected from a male and female spakers. A fixed set of features is being extracted and supply to the Classifier as training set.

During the experiment, the voice features of an unknown entity will be extracted and being streaming to the Classifier for telling the gender. From the results, it captured 100% accuracy in identifying the voice of Male and 98% accuracy in identifying the voice of Female, by using Random Forest Classification algorithm. It scores 100% for Male voice identification and 99% for Female voice identification if switched to SVM Classifier. However, only 97% for Male voice identification accuracy and 98% for Female voice identification accuracy if using Logistic Regression algorithm.

Data Processing Layer : Outlier Detector (OD) Component

The Outlier Detector component is developed using Linear Regression and Random Forest Regression Machine Learning (ML) model incombination with Mean Square Error (MSE) to detect outlier data in streaming pipeline. The component leverage the ML model to adapt to the dataset during training phase. After the model has trained, with the MSE value is between the range of 0.01 to 0.05, then component can turn into a detector component.

Our test case is using a human generated sound wave data, containing 3600 records with 4 frequency features and 1 target output, named Y. Another set of outlier dataset was created, called "Sound-Wave-Outlier" dataset by increasing or decreasing the target valules of Y to 20% ~ 25% randomly for the records range from no : 100 ~ 300, 600 ~ 800, 1100 ~ 1500, 2100 ~ 2500, 2800 ~ 3000, 3400 ~ 3600. Initially, the OD component is supplied with normal sound wave data. After couple of cycles the OD component adapt to the model of data, the "Sound-Wave-Outlier" data start to replace the original data. The mode of OD model are configured from Adaption mode to become Detection mode.

Figure below show the predicted values, which displayed in Yellow color, that are suppose to be received based on features that supplied to the model. Observed also the actual values that received from the outlier data streaming into the OD component, which are displayed in Green color.

Real-time MSE values are calculated for every single point received, and the values are used for the outlier case detection. If MSE value is discovered more than the pre-configured threshold value, then it raise the alert. If the MSE value not overshoot the control threshold value, the not indicator mark.

Figure below show the graph plotted for the MSE value for the demo, and outlier detected indicator in another graph, stated at the right-hand side.

Case Study : Human Activities Recognition (HAR)

The implementation of Human Activities Recognition (HAR) use case by using HS-RAP platform is just taking 3 days. At first, we pre-collected all the human activities features (eg: X, Y, Z acceleration, velocity, skin temperature, heart beat rate and etc.) from the selected 5 persons, named Andy, Ben, Chris, Daniel and Ether. The subject are wearing IoT device, that will transmit a series of information via Kafka queue and store into a data file. We labelled each set of the collected data into 5 person names. Each set of data contains feature values that belonged to a 6 specific activities performed by the subjects. The activities are WALKING, SITTING, WALKING UPSTAIRS, STANDING, WALKING DOWNSTAIRS and LAYING. These activities are represented using value 1 to 6 in the dataset. The collected dataset were devided into 70% as training dataset, 30% for testing dataset.

Figure below summarize the representation and testing dataset concept. In summary, the training dataset contains of motion features captured from each subject. There are 562 total of features captured, with a total of 1609 recordset. Each recordset labelled with activity label, store into the data file of each subject respectively.

The implementation was using Classification component in learning the training dataset. A specific training dataset that belonged to a subject is supplied to the Classificaiton component to traing a model. The model is saved in HDFS named with the subject ID. In other words, specific Classification model is used for each subject. For example, Andy will has his own classification model, trained with his personal dataset while Ben will has his own too. Thus, at the end of training phase, each subject will have their own HAR recognition model, supported by ML Classification component.

During the experiment, all data that's belonged to every subjects are streaming into the Kafka input pipeline in order to simulate the situation as the hand-held IoT data are capturing from different users at anyhwere and anytime. Each recordset carry the subject id but without activities label. The figure below showing the results in recognizing the activities performed by each subject compared with the actual activities captured in the testing dataset.

Figure below zoom in to view the result produced by a specific subject, called Andy. The graph above is plotting the known activities captured in the testing dataset. The graph is plotting 1 if the activity label in the recordset is 1. Label 1 indicating the features of the recordset are indicating subject is WALKING. If the activity label is indicating 6, that's means subject is LAYING. The graph below is plotting the label that is getting from the Andy's HAR component. Based on the same set of features in the recordset in the testing data, the HAR component is return the respecitive label class. From the result, it is showing the Actual graph pattern is similar to the Prediction graph pattern. This indicates that Andy's HAR component has well-performed in recognizing Andy's activity.