Why do you need training set, test set and validation set ?

Before any model is built for the problem in hand, the dataset exists as a single entity. One can start learning from entire dataset and use the built models to make predictions on new unseen data. The later part is called generalization in Machine Learning terminology. However, training on entire dataset available would lead to an overfitted model, highly adapted to the dataset used while training. Standard solution to this problem is to divide the dataset into 2 non overlapping sets called training set and test set.

Test set acts as unseen data such that model is trained on training data and tested on the test set. If performance on test set is bad but good on training set, it is an overfitting model and we try to resolve it using different resolutions for overfitting as mentioned here. Hence the purpose of testing and test set is for generalization on unseen samples.

Validation Set : Most of the Machine Learning models have hyper parameters to be tuned. For eg, right value of K in K-Means or K-NN are to be discovered by tuning. Now while training and testing, we figure out that we need to change our model or its parameters. We can do this with test set also by checking the error on test set for each combination of model and its parameters. But doing this will make our model more adaptable to both training and test set and may not generalize well on new unseen data. Hence we need another set called validation set which is useful for finding the right model and right parameters for the model. Therefore, purpose of validation and validation set is for tuning hyper-parameters used in the Machine Learning model.

Follow up question – How do you perform validation or what are the standard ways of performing validation ?

Leave a Reply

Your email address will not be published. Required fields are marked *