More often, ensemble methods are a fusion of weaker models like decision trees, with low depth or subset of features used to split. Following are the main reasons why ensemble methods work better than individual models
- Each individual model works on some aspect(some features) of the dataset. Hence ensemble methods is a mixture of many models focussed on some aspect of the dataset. You might argue why not individual models working on all the features gives better result! Good argument but when all the features are involved, there is a tendency that each feature is trying to weigh its own importance than others, hence the individual model ends up doing nothing but trying to arrange the fight between these features.
- Individual weaker models have lower model complexity which leads to lower variance error. Hence when ensemble methods combines all the methods, it give better result because of the first reason, and variance error remains lower when compared to an individual model with similar capabilities.
- The result of both the above points is that the decision boundary in ensemble methods is more smoother leading to less overfitting and better generalisation and hence good prediction on unseen data.
Analogy: To understand the reasoning, let us take an analogy of estimating a biased coin parameter.
- A coin has one parameter which is the probability of predicting head or tail. Suppose it is given that the coin is biased with 51% chances of coming up with heads.
- For first few tosses, say in order of 10s, we may not see this probability coming, i.e. out of 100 tosses, number of heads might be way off from 51.
- Now if we toss the coin 1000 times, we will be much closer to 510 heads and 490 tails leading to majority of heads(51 against 49). Probability of obtaining a majority of heads( # of heads > # of tails) after 1000 tosses is around 75%. And this probability increases to around 97% with 10,000 tosses.
- The theory behind the above phenomena is called law of large numbers. If we use ensembles (say 1000 tosses) instead of one big set of tosses like 10,000 or 100,000, we can still get 75% probability of getting more heads than tails using voting mechanism.
- From above, each individual set of tosses is like a weak learning model trained on a small to medium size training set. Ensemble method can aggregate(using majority vote) predictions from all the weak models. This can increase the chances of getting more heads than tails to 75% from 51% of one weak model!
- Thus we showed how ensemble method can give better results than an individual model.
More deeper explanation:
- Each individual model works on some aspect of the problem. Most likely each separate model will work on some independent aspect of the problem. When combined all these individual models prediction, the ensemble method is actually working on more(than individual model) aspects of the problem either due to weightage to more features or due to more training examples.
- Each individual model would perform suboptimal as it may not target the problem from all different angles. However, an ensemble of all weak individual models combined can target the problem from more angles than just few like any individual weak model. Read here for more understanding.