Imbalanced dataset or imbalanced class problem offers various challenges. One of the many possible ways of solving this problem is via oversampling from the minority class. However, oversampling without addressing the following issues can be dangerous:
- Usually, we begin with splitting entire dataset into training and testing set. Training set is further split into training and validation set. If validation split is done after oversampling has been done on the original bigger training set, it leads to overfitting and bad performance on the test set. The whole purpose of validation set is to represent unseen data like test set to validate the stability and generalisability of the machine learning model.
Notice in the above figure how oversampling before doing validation split leads to overfitting.
For a regularised model, perform oversampling each time on the training set and validate that model on the imbalanced validation set. Validation and test set should resemble each other in machine learning, i.e. they should have the same probability distribution.