Random forest is an ensemble method

Ensemble learning

Imagine you have the question of your life before you. Answering this question correctly will have a positive impact on your life, otherwise it will have a negative impact. But you're in luck: you can ask advice from an expert you may choose, or you can ask advice from an anonymous group, say 1,000 people. What advice would you take? The individual expert opinion or the aggregated answer of a whole group of people?
Or how about a group of experts?

Ensemble learning

When using a machine learning algorithm on a specific problem, adequate precision (accuracy, a rate of prediction results that can be classified as correct) can be achieved, but the reliability of a single algorithm is often not sufficient. Algorithms can be used with different parameters that have different effects in certain data situations. Certain algorithms are prone to underfitting, others to overfitting.

If machine learning is to be developed and used for productive use with the best possible reliability, ensemble learning is useful. In ensemble learning, an ensemble (collective of predictors) is formed in order to form an ensemble average (collective mean). If, for example, some classifiers run out of results with certain data inputs, other classifiers will counteract this. Ensemble learning is therefore used in the hope that a group of algorithms will produce a better result on average than a single algorithm could.

In the following I prefer to speak of classifiers, but ensemble learning is also used for regression.

Voting classifiers (or voting regressors)

A common form - and usually also the first example of an ensemble learner - is the principle of voting classifiers. The principle of voting classifiers is an extremely easy-to-understand idea of ​​ensemble learning and therefore probably one of the best-known forms of collective mean formation. First of all: Yes, there are also voting regressors, but this is a concept that cannot do without more comprehensive aggregation at the top level, so stacking (see below) would be more useful for the purposes of more accurate regression.

A common question in data science is which classifier is better for certain purposes: decision trees, support vector machines, k-nearest neighbors or logistic regressions?

Why not just use them all? In fact, this is not infrequently practiced. The aim of this form of ensemble learning is easy to see: the different weaknesses of all algorithms should - so the hope - cancel each other out. All algorithms (this can also mean several identical algorithms with different parameters, e.g. several knN classifiers with different k-values ​​and dimension weightings) are trained for the same problem.

In the case of prediction, either all classifiers are treated equally or weighted differently (although larger differences in weightings are unusual and probably also not useful). According to an ensemble rule, the results of all classifiers are aggregated, in the case of classification by a majority decision, in the case of regression mostly by averaging or (in the case of stacking) by another regressor.

Apart from the fact that we will probably have better results with the ensemble classifier or regressors, we now have another piece of information: an entropy about the probability. At best all classifiers have calculated the same prediction, at worst we have a tie. In this way we can evaluate predictions in terms of their informative value. Similarly, in the case of regressions, the variance of the results can be used to evaluate the significance of the result.

Consideration in the context of: A chain is only as strong as its weakest link

It is often said that ensemble learning produces better results than the weakest classifier in the group, but also worse than the best classifier. So is ensemble learning just an act of perplexity as to which classifier would actually be better?

Yes and no. Ensemble learning is actually used in practice to intercept individual weaknesses and also to weaken outlier behavior on previously different data. It is also the case, however, that Ensemble Learner can provide even better predictions with many classifications than the best classifier in the program.

This is due to the law of large numbers, which can be illustrated using an example: With a (balanced) coin toss, the probability is exactly 50.00% that head or number to obtain. For example, if I toss the coin ten times, I might get three times head and seven times number. If I throw it 100 times, I might get it 61 times head and 39 times number. Even only 20 times that number to get would not be far from unlikely with only 100 litters. If I were to toss the coin 10,000 times, however, I would come very close to 50%, with 10 million tosses the distribution will most certainly be an equal distribution with 50.0x% for head or number level off.

Now one imagines (somewhat exaggerated, as it is analogous to the throwing of rolls) an Ensemble Learner with a group of 10,000 classifiers. And assuming that every single classifier is extremely weak, because a correct prediction only applies with a precision of 51% (i.e. hardly more than a game of chance), then the majority of the 10,000 classifiers (namely 51%) would be correct and the majority decision in make the correct prediction in the absolute majority of cases.

What is true in this context: Predictions via ensemble learning are inevitably slow. A lot of time can of course be saved by parallelizing the classification, but then the ensemble learning is at least as slow as the slowest classifier.

Bagging

One argument against the use of completely different algorithms is that such an ensemble Learner is difficult to understand and assess (a general problem in machine learning, by the way). Even a single algorithm (e.g. support vector machine) can turn out to be quite different in its prediction after each training session based solely on the data selected (for training and testing).

Bagging (short form of Bootstrap aggregation) is an ensemble learning principle, in which basically the same algorithm is trained (and of course tested) in parallel with different divisions of the data. The data can be divided up completely (the complete data set is distributed and used) or only via random samples (then there are data points that are used multiple times, but also those that are not used at all). In particular, the aim is to avoid under- and over-fitting in the end result. If there are many density clusters and outliers in the data, not every classifier will be able to adapt to them. Each instance of the classifier receives largely different data with its own outliers and density clusters; there may well be overlaps in the data division.

Pasting

Pasting is almost exactly like bagging, with the small but subtle difference that the data division must not overlap. If a data point is randomly assigned to a classifier, it is no longer used for another classifier. Accordingly, no other classifier has the training data of one classifier. The classifiers are thus trained completely independently of one another, which can sometimes be explicitly wanted. Of course, pasting assumes that enough data is available. This requirement is also an answer to many problems: How can large amounts of data be processed quickly? By dividing it into parallel nodes without overlapping.

Random Forest

Random forests should not actually appear at this point in the text, because they are an example of the parallel ensemble or the voting classifier with decision trees. Nevertheless, I would like to address random forests at this point, because they are an extremely common application of bagging or (more rarely) pasting for decision tree processes. The amount of data is divided up at random and a decision tree is created from each division. A majority decision for the classifications of all trees is the ensemble learning of the random forest.

Random forest is a method of classification or regression that is already so common that it has long been implemented in (almost) all machine learning libraries and - thanks to this implementation - not more complicated to use than a single decision tree.

Stacking

Stacking is an extension of the voting classifier or voting regressor by a higher level (blending level), which learns the best aggregation of the individual results. At the top of the stack is (at least) one further classifier or regressor

Stacking is particularly useful when the results of the individual algorithms can turn out to be very different, which is almost always the case with regression - since constant values ​​instead of a few classes. Stacking algorithms can even have multiple layers, which makes them much more difficult to train.

Boosting (Sequential Ensemble Learning)

Bagging, pasting and stacking are parallel processes of ensemble learning (which does not mean that the algorithms presented in parallel are not processed sequentially in practice). On the other hand, boosting is necessarily carried out sequentially, in which we want to reinforce weak classifiers or regressors through iteration in their training. Boosting can therefore be seen as an alternative to deep learning. While in deep learning a strong algorithm is designed and trained by a multi-layered artificial neural network to solve a complex problem (e.g. test recognition [OCR]), such challenges can also be met with weaker classifiers using boosting.

Boosting relates solely to training and arose out of necessity: How do we get better predictions with what is actually a weak learning algorithm that tends to produce underfitting? Boosting is an answer to the challenges of classification or regression, in which an algorithm is trained iteratively, i.e. in several runs, by adapting weights.

One of the best-known boosting methods is AdaBoost. The first step is normal training. The subsequent testing reveals classification / regression errors. The incorrectly predicted data points are then weighted higher for the next run. This iteration runs a few times until the error rate no longer improves.

With AdaBoost, incorrectly predicted data records are weighted higher in the next run. In an alternative boosing method, gradient boosting (based on the gradient method), weightings are explicitly adjusted in the opposite direction of the prediction error.

For example, what is the random forest in the voting classifier, in which several decision trees work in parallel, is the equivalent in boosting, the gradient boosted trees, in which each tree can only describe part of the data accurately, but the sequential nesting of the trees also masters challenging classifications .

To stay with the example of the decision trees: Both Random Forests and Gradient Boosted Trees basically work with flat trees (weak classifiers). Gradient boosted trees can generally achieve a higher precision of the prediction than random forests through iterative amplification, if the feature and parameter selection is sensible at the beginning. Random forests, on the other hand, are more robust in terms of feature and parameter selection, but do not reinforce each other, but are as good in their end result as the majority of trees.

Book recommendations

Would you like more about machine learning and ensemble learning? The following two book recommendations not only offer explanations, but also demonstrate ensemble learning with example code with Python Scikit-Learn.

Benjamin Aunkofer

Benjamin Aunkofer is lead data scientist at DATANOMIQ and university lecturer for data science and data strategy. In addition, he works as Interim Head of Business Intelligence and gives seminars / workshops on BI, data science and machine learning for companies.

Tags:Artificial Intelligence, Data Science, Decision Tree, Deep Learning, Ensemble Learning, Machine Learning, Naive Bayes, SVM