Why ensemble modelling works so well - and one often neglected principle

Putting models together in an ensemble learning fashion is a popular technique amongst data scientists

Ensemble learning is the simultaneous use of multiple predictive models to arrive at a single prediction, based on a collective decision made together by all models in the ensemble. It's a common and popular technique used in predictive modelling, especially when individual models are failing to produce the required performance levels, in terms of e.g. accuracy.

Ensemble learning is often introduced towards the end of any Data Science 101-type content, and often emphasized in terms of implementation rather than the underlying reason behind its success. It's also a question I get asked often.

In this post I will conduct a simple statistical treatment to illustrate why ensemble learning works, and one important catch that most data scientists neglect.

Binary classifiers as biased coin flips

Consider a binary classifier c₁ on a yes/no classification problem. Being a reasonably constructed classifier, c₁ has an accuracy of 60%. This means that the probability of c₁ giving the correct prediction is 0.60, like a flip of a biased coin.

Now consider putting three such classifiers together in a democratic fashion. This means that the set of classifiers (the ensemble) would give the correct prediction if and only if two of the three (doesn't matter which two) or more give the correct prediction.

If you do your combinations calculations correctly, you would be able to arrive at the overall performance of the ensemble being 0.648:

Pr({correct, correct, wrong})   = (0.6²)(0.4) = 0.144
Pr({correct, correct, correct}) = 0.6³ = 0.216

Accuracy of ensemble = 0.216 + (0.144)(3) = 0.648 > 0.6
(3 times because there are 3 different ways of getting 2 correct, 1 wrong.)

Generally, we can say that as the number of models within the ensemble increase, so does the accuracy of the ensemble. However, this result is valid only if the individual classifiers are independent amongst each other - something which most data scientists fail to understand or appreciate. Consider this next piece of math.

(In reality, true independence is hard to either attain or assess, so we settle with low or zero correlation.)

Binary classifiers as Bernoulli trials

Every time we ask our binary classifier c₁ for a prediction, we are essentially conducting a Bernoulli trial with:

E(c₁) = p
Var(c₁) = p(1-p)

Putting together our ensemble of 3 independent classifiers again:

(ens.) = ⅓(c₁ + c₂ + c₃)
E(ens.) = p (unbiased)
Var(ens.) = ⅓p(1-p) < p(1-p) = Var(c₁)

With this, it's clear why the independence or negligible correlation condition is necessary - otherwise:

Var(ens.) = ⅓p(1-p) + Cov(c₁,c₂) + Cov(c₁,c₃) + Cov(c₂,c₃)
(all pairwise covariances)

With the additional pairwise covariance terms, it is no longer guaranteed that

Var(ens.) < Var(c₁)

Without going through the math again, this set of results can be applied to regression problems with no loss of generality.

What does this mean and what I can do with this

Clearly, we need our ensemble to be reliable and not wobble all over the place with high prediction variance. It's intuitive why the negligible correlation condition makes sense - correlated models would more often than not support each other and make the same yes/no predictions simultaneously, even if the given test case could jolly well be in the grey zone.

In addition, it should be clear to you now that there's not much use in assembling strong learners together in an ensemble - they are likely to be accurate per se, and thereby correlated with each other with the test cases. All you are doing is to increase the variance of your predictions. On the other hand, putting a bunch of weak learners would make sense because they are likely to be less correlated amongst each other.

Finally, the next time when someone presents an ensemble learning approach, ask if they ever consider the correlations amongst the underlying models. Odds are that they would take you a blank look and not sure why that's necessary :P

(If you are interested to learn more about ensemble learning and how it works in algorithms like random forests, feel free to take a look at this repo on my Github.)