Ensemble learning is the simultaneous use of multiple predictive models to arrive at a single prediction, based on a collective decision made together by all models in the ensemble. It's a common and popular technique used in predictive modelling, especially when individual models are failing to produce the required performance levels, in terms of e.g. accuracy.
Ensemble learning is often introduced towards the end of any Data Science 101-type content, and often emphasized in terms of implementation rather than the underlying reason behind its success. It's also a question I get asked often.
In this post I will conduct a simple statistical treatment to illustrate why ensemble learning works, and one important catch that most data scientists neglect.
Binary classifiers as biased coin flips
Consider a binary classifier c1 on a yes/no classification problem. Being a reasonably constructed classifier, c1 has an accuracy of 60%. This means that the probability of c1 giving the correct prediction is 0.60, like a flip of a biased coin.
Now consider putting three such classifiers together in a democratic fashion. This means that the set of classifiers (the ensemble) would give the correct prediction if and only if two of the three (doesn't matter which two) or more give the correct prediction.
If you do your combinations calculations correctly, you would be able to arrive at the overall performance of the ensemble being 0.648:
Pr({correct, correct, wrong}) = (0.6²)(0.4) = 0.144 Pr({correct, correct, correct}) = 0.6³ = 0.216 Accuracy of ensemble = 0.216 + (0.144)(3) = 0.648 > 0.6 (3 times because there are 3 different ways of getting 2 correct, 1 wrong.)
Generally, we can say that as the number of models within the ensemble increase, so does the accuracy of the ensemble. However, this result is valid only if the individual classifiers are independent amongst each other - something which most data scientists fail to understand or appreciate. Consider this next piece of math.
(In reality, true independence is hard to either attain or assess, so we settle with low or zero correlation.)
Binary classifiers as Bernoulli trials
Every time we ask our binary classifier c1 for a prediction, we are essentially conducting a Bernoulli trial with:
E(c1) = p Var(c1) = p(1-p)
Putting together our ensemble of 3 independent classifiers again:
(ens.) = ⅓(c1 + c2 + c3) E(ens.) = p (unbiased) Var(ens.) = ⅓p(1-p) < p(1-p) = Var(c1)
With this, it's clear why the independence or negligible correlation condition is necessary - otherwise:
Var(ens.) = ⅓p(1-p) + Cov(c1,c2) + Cov(c1,c3) + Cov(c2,c3) (all pairwise covariances)
With the additional pairwise covariance terms, it is no longer guaranteed that
Var(ens.) < Var(c1)
Without going through the math again, this set of results can be applied to regression problems with no loss of generality.
What does this mean and what I can do with this
Clearly, we need our ensemble to be reliable and not wobble all over the place with high prediction variance. It's intuitive why the negligible correlation condition makes sense - correlated models would more often than not support each other and make the same yes/no predictions simultaneously, even if the given test case could jolly well be in the grey zone.
In addition, it should be clear to you now that there's not much use in assembling strong learners together in an ensemble - they are likely to be accurate per se, and thereby correlated with each other with the test cases. All you are doing is to increase the variance of your predictions. On the other hand, putting a bunch of weak learners would make sense because they are likely to be less correlated amongst each other.
Finally, the next time when someone presents an ensemble learning approach, ask if they ever consider the correlations amongst the underlying models. Odds are that they would take you a blank look and not sure why that's necessary :P
(If you are interested to learn more about ensemble learning and how it works in algorithms like random forests, feel free to take a look at this repo on my Github.)