What are the drawbacks of oversampling minority class in imbalanced class problem of machine learning ?

Imbalanced dataset or imbalanced class problem offers various challenges. One of the many possible ways of solving this problem is via oversampling from the minority class. However, oversampling without addressing the following issues can be dangerous:

  1. Usually, we begin with splitting entire dataset into training and testing set. Training set is further split into training and validation set. If validation split is done after oversampling has been done on the original bigger training set, it leads to overfitting and bad performance on the test set. The whole purpose of validation set is to represent unseen data like test set to validate the stability and generalisability of the machine learning model.
imbalanced class oversampling drawbacks
Overfitting if oversampling done BEFORE train validation split

Notice in the above figure how oversampling before doing validation split leads to overfitting.


oversampling minority class in imbalanced class problem after train validation split
Overfitting is prevented if oversampling done AFTER train-validation split

For a regularised model, perform oversampling each time on the training set and validate that model on the imbalanced validation set. Validation and test set should resemble each other in machine learning, i.e. they should have the same probability distribution.

Leave a Reply

Your email address will not be published. Required fields are marked *