Why do you need training set, test set and validation set ?

Before any model is built for the problem in hand, the entire dataset exists as a single entity. One can start learning from this dataset and use the built models to make predictions on unseen data. The later part is called generalisation in Machine Learning terminology. Training on entire dataset leads to an overfitted model that is highly adapted to the dataset used in training. Standard solution to this problem is to divide the dataset into 2 non overlapping sets called training set and test set.

Test set acts as unseen data such that model is trained on training data and tested on the test set. If performance on test set is bad but good on training set, it is an overfitting model and we try to resolve it using different resolutions for overfitting as mentioned here. Hence the purpose of testing and test set is to check generalisation on unseen samples.

Validation Set also known as development or dev set: Most of the Machine Learning models have hyper parameters to be tuned. For eg, right value of K in K-Means or K-NN are to be discovered by tuning. While training and testing, we realise that we need to change our model or its parameters. If we use the error on test set to optimise model parameters, our model adapt itself to both training and test set leading to overfitting. Hence we need another set called validation set which is useful for finding the right model or its parameters. Therefore, purpose of validation set is for tuning the hyper-parameters.

Follow up question – How do you perform validation or what are the standard ways of performing validation ?

Leave a Reply

Your email address will not be published. Required fields are marked *