Big Data Stream Mining with Online Learning – Support Vector Machines

Immense data streams from various origins help businesses to make important data-driven decisions, upscale profits, and to harness new opportunities. Hence, more enterprises have the need to apply online machine learning to react directly to events in Big Data streams.

We discuss how special Online Learning Support Vector Machines (SVM) differ from ordinary offline SVMs in Data Stream mining and how they can be used for Big Data analysis.

Information and data is moving at an unprecedented pace
The Big Data industry is growing faster than ever in this technological era where everything and anything requires precise information. Source: Stephan Henning

Introduction

It has become more important than ever to understand how Big Data can be valuable if used properly. The amount of data created each year is growing faster than ever before. Everyone is trying to change and develop ways to understand data better in a more efficient way. Hence the question is –

“How can businesses leverage data to make a valuable decision”

  • Expense ReductionBig data tools and cloud analytics bring significant cost benefits when it comes to storing massive amounts of data. They help to guide and distribute data efficiently

These stated above are just some of the advantages that organisations can attain from the power of Big Data. With Machine Learning, however, the acceleration and impact of data are tremendous.

Utilising Big Data properly will spur growth in an organisation and in turn bring more profits. Source: SpaceX

Extensive and large calculations done over existing datasets in a Neural Network cannot be implemented over a traditional approach. The workaround is to implement distributed computing using Big Data technologies like Apache Mahout, Spark, R-Hadoop to feed output to Machine Learning algorithms for its applications. This is where Machine Learning meets Big Data.

In a Neural Network, traditionally the model is trained using datasets that are sampled from a pool of data and then deployed into production. This is called Offline learning. In Big Data, however, data is rarely ever constant and is instead continuously changing its patterns and trends.

Active Learning in Support Vector Machines

Offline Learning SVM

SVM is a supervised Machine Learning algorithm that is used in many classifications and regression problems. It still presents as one of the most used robust prediction methods that can be applied to many use cases involving classifications.

SVM works by finding an optimal separation line called a ‘hyperplane’ to accurately separate 2 or more different classes in a classification problem. The goal is to find the optimal hyperplane separation through training the linearly separable data with the SVM algorithm.

As for offline SVM, there are 2 types of classifiers namely –

  • Linear ClassifierWhere a straight-line function can be drawn to separate all the items in class A and class B
A straight red line (hyperplane) can be optimised to differentiate items in Class A and Class B
A straight red line (hyperplane) can be optimised to differentiate items in Class A and Class B
  • Non-Linear ClassifierThe mapping of the original feature space to some higher-dimensional space where the training set is separable using a special kernel function.
The red boundary is the RBF function that is influenced by certain parameters.
The red boundary is the RBF function that is influenced by certain parameters. Source: Chris Albon

In offline SVM, the algorithm is trained on data that is not continuously changing. But what if the data being streamed is of different patterns and happens in real-time? This is where active learning SVM is important.

Online Learning SVM

In Active learning SVM, the assumption is that the SVM algorithm is not trained on only 1 sample of data, but is continuously being trained with real-time data observations coming in periodically.

One of the most popular active learning SVM algorithms is called LaSVM. LaSVM is a Big Data stream mining algorithm developed by Bordes in 2005 which incorporates the workings of Support Vector Machines but with online kernel classifiers.

The algorithm uses the traditional SVM (Quadratic Programming) solver with online kernel approximation by using the similar single sequential pass method used in SVM.

Dynamic hyperplane is retrained and adjusts itself to the data that is coming in
LaSVM classifies the continuous Big Data stream robustly, with dynamic hyperplane.

When real-time data is fed into LaSVM continuously, the algorithm finds out the correct label using the trained model at that point of time.

It then updates its hyperplanes, if necessary, based on the new inserted samples. This characteristic of LaSVM makes it suitable for dealing with big streaming data.

LASVM can be used in the environment with a real-time setup where the model is given a continuous stream of fresh random examples. The online iterations process fresh training examples as they come. There are more advantages of active learning-based SVM in regards to current Machine Learning applications.

How is it different?

Although both Offline and Online SVM can be used for any application, it depends on the types of data that are being given. A constant data will be the option for Offline learning, whereas a continuously changing data will be most suitable for Online SVM. Below is a detailed comparison table between Online and Offline SVM.

Data Types

Features distinguishing both Online and Offline Learning SVM
Online SVM can handle stationary as well and non stationary data where it is continuously changing its patterns.

Model Features

A comparison features distinguishing both Online and Offline Learning SVM
Online SVM is less complex in handling data of different patterns due to its ability to re-train is algorithm when a new sample of data is given.

Final Thoughts?

The use of Offline or Online SVM ultimately depends on the applications. The above table summarises the different features both algorithms exhibit, and from there it can be decided on which algorithm is the most suitable one.

But in a general sense, offline learning models are much more straightforward to deploy and manage but less adaptable to the changes in data.

Online learning models are more complex in the sense that they require more effort and time since the new stream of data is continually being pushed. That requires all the preprocessing of data where it will take up more time and cost.

Support Vector Machines is a huge area of study. There are numerous books and papers on the topic. Listed below are some of the resources that can be referred to dive deeper into the algorithm itself.

Offline Support Vector Machine (SVM)

Online Support Vector Machine (SVM)

No Responses

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    Your free special webinar guest invitation: How to avoid the worst big data failures