What is Big Data Stream Mining with Adaptive Random Forests

by Prem

Enterprises have the need to apply online machine learning to react directly to events in Big Data streams by using Big Data Stream mining. In this article, we discuss how Random Forests differs from Adaptive Random Forests and how they can be used for Big Data analysis.

Table of Contents

The growth of data is tremendous over the past decade — The data industry is growing as ever in this technological era where everything and anything requires precise information. Source: Stephan Henning

Neural Networks are insufficient for Big Data Stream mining

Neuronal Networks and Deep learning are often considered as the non plus ultra in machine learning. However, Neural Networks are often not the perfect Big Data Stream mining.

A Neural Network is trained with input data of any sample size for any task which includes classification, recognition, Natural Language Processing (NLP) and more. These computations using Neural Network mostly requires large amounts of processed data of similar pattern and type in which the model will be trained on.

But what if the pattern of data is changing continuously and in real-time?

Extensive and large calculations done over existing datasets in a Neural Network cannot be implemented over a traditional approach and when the patterns in the data are not stable, a Neural Network will likely fail. The workaround is therefore to implement Big Data Stream mining with distributed computing. Thereby, one uses using Big Data technologies like Spark Streaming, Apache Flink or similar to feed data continuously to Machine Learning algorithms for its applications.

What is Random Forest?

To understand Adaptive Random Forests, we have to first understand decision trees. Decision trees in most simple terms are basically a decision tool that uses root and branch-like models to identify possible consequences by using control statements.

How Random Forests works with multiple decision trees — Multiple forests will provide individual observations which is called Ensemble learning which combines outputs from multiple learners

Random Forests are basically the combination of multiple individual decision trees to act as an ensemble. Ensemble learning can be defined as a paradigm whereby multiple learners are trained to solve the same problem.

Ensemble learning includes multiple learning models to prediction — A single batch of data is fed into multiple models of different algorithm and the best performing ones are generalised and used for the final prediction

Ensemble learning actually has been used in several applications such as optical character recognition, medical purpose, etc. In fact, ensemble learning can be used wherever machine learning techniques can be used.

Combining multiple trees (learner) may be a better choice if the learners are performing well. In fact, Rando m Forests is currently one of the most used Machine Learning algorithms in the batch setting.

What is Adaptive Random Forest?

Big Data stream, or Online Learning classification, is similar to batch or Offline classification in the sense that both are concerned with predicting a class value of an unlabelled instance. The difference between Online learning or Adaptive vs Offline learning, however, is how the learning and predictions take place.

How does an Adaptive Random Forests work?

Similarly to Adaptive Random Forests, there are also different online algorithms which can be used for Big Data such as Online Learning Support Vector Machines (SVM) which can be adapted for real-time data streaming.

In general, for Random Fo rests, the key elements are bagging and random feature selection. Bagging stands for Bootstrap Aggregation which is a very powerful ensemble method. It combines predictions from multiple algorithms to have a more accurate prediction as stated above. In Offline or traditional Random F orests, non-streaming bagging is done.

Decision trees will output a certain prediction — In traditional Offline bagging, each sample size from the original dataset is an exact copy, M of size N where it will be fed into different Random Forests which then outputs a binomial distribution

The algorithm builds a set of M base samples with size N, training each models or learners individually taken from random samples with replacement from the original training dataset. The predictions are then produced which follows a binomial distribution. When the value of N is large, the distribution now becomes a Poisson distribution.

In Online bagging, instead of replacement, each of the M samples is given a certain random weight, W which follows a Poisson distribution

When the weights, W of a certain sample M is greater or more significant than the other, the others are then resampled accordingly. By this way, the diversity of the weights and the input space of the classifiers inside the ensemble are being modified actively to find the accurate prediction with the most number of occurence.

A random weight is applied to a sample size in online bagging — A random weight factor, W is assigned to each M sample from the dataset. The weight is actively monitored and whichever M is not performing well, it is then replaced with a new sample from the dataset. In this case, M1 performs well (larger W1) over M2, M3 and M4 hence, those will be replaced with new samples.

In Online bagging, concept drifts are ought to occur. Adaptive Windowing (ADWIN) is a popular method to detect a change of pattern in data. In relation to Online Random Forest, we can then train the classifier with the random weights, and when a change or drift is detected, the worst classifier of the ensemble of classifiers is removed and a new classifier is added to the ensemble.

Adaptive Random Forest vs Random Forest

Bagging Method

Traditional Random Forests – Multiple classifiers are trained and the most frequently seen observation from the forests are taken as the prediction

Adaptive Random Forests – Random weights, W is assigned to each classifier and depending on the performance, it is either replaced by a new sample set or retained to produce the prediction following Poison distribution (number of independent events)

Concept Drift

Traditional Random Forests – Does not take drift into account which means it will perform poorly with streaming data

Adaptive Random Forest – Takes into account pattern drifts and removes the classifiers which are not performing well to optimise ensemble model

In the End

In a general sense, Adaptive Random Forests are commonly used in Big Data applications where you would have real-time big streaming data for classification or prediction.

Traditional Random Forests, however, is more defined for non-streaming data with simpler prediction applications. The use of the Adaptive Random Forest algorithm ultimately depends on the types and characteristics of the data that we have.

Are you looking for ways to get the best out of your data?

If yes, then let us help you use your data.

Sum-up FAQ

Why are Neural Networks not suitable for Big Data Stream mining?

Extensive and large calculations done over existing datasets in a Neural Network cannot be implemented over a traditional approach and when the patterns in the data are not stable.

What are the main features of Adaptive Random Forests vs Traditional Random Forests?

In Adaptive Random Forests, random weights, are assigned to each classifier to indicate prediction performance. Adaptive Random Forests can also handle concept drifts when the data pattern is not stable.

What are ensembles in Big Data Stream mining??

Ensemble learning can be defined as a paradigm whereby multiple learners are trained to solve the same problem.

What is Bagging in Big Data Stream mining?

Bagging decreases the variance in the prediction by generating additional data, remove data and optimising it.

Everything You Need To Know About Machine Learning

Which is better – Random Forest vs Support Vector Machine vs Neural Network

A Brief Overview of Support Vector Machines (SVM)

A Simple Introduction to Online Machine Learning

Big Data Stream Mining with Online Learning – Support Vector Machines

https://www.iunera.com/kraken/big-data-science-intelligence/machine-learning-forecasting-ai/what-do-you-need-to-know-about-neural-networks-in-general/?swcfpc=1

https://www.iunera.com/kraken/open-big-data-science/big-data-curation-in-machine-learning/?swcfpc=1

A Simple Introduction to the k-Nearest Neighbour (kNN) Algorithm

Let us know your challenges or support us by sharing the article

Check iunera.com to learn more about what we do!

Categories:

Big Data Lessons

Tags:

algorithm bigdata BigDataConcept machineLearning onlinelearning onlinerandomforest

What is Big Data Stream Mining with Adaptive Random Forests

Neural Networks are insufficient for Big Data Stream mining

What is Random Forest?

What is Adaptive Random Forest?

Adaptive Random Forest vs Random Forest

Bagging Method

Concept Drift

In the End

Sum-up FAQ

Why are Neural Networks not suitable for Big Data Stream mining?

What are the main features of Adaptive Random Forests vs Traditional Random Forests?

What are ensembles in Big Data Stream mining??

What is Bagging in Big Data Stream mining?

Let us know your challenges or support us by sharing the article

Search

Recent Posts

Latest Changes

What is Big Data Stream Mining with Adaptive Random Forests

Neural Networks are insufficient for Big Data Stream mining

What is Random Forest?

What is Adaptive Random Forest?

Adaptive Random Forest vs Random Forest

Bagging Method

Concept Drift

In the End

Sum-up FAQ

Why are Neural Networks not suitable for Big Data Stream mining?

What are the main features of Adaptive Random Forests vs Traditional Random Forests?

What are ensembles in Big Data Stream mining??

What is Bagging in Big Data Stream mining?

Related Posts

Let us know your challenges or support us by sharing the article

Search

Recent Posts

Latest Changes

Categories