Many machine learning problems come with an issue of imbalance dataset. This could be either due to the property of the problem itself or because of the way data has been collected. For eg, applications like fraud detection has relatively less frauds compared to normal transactions. Such problems are in the first category. In other cases, data collection was challenging for one of the classes that led to imbalance.
This offers various challenges in solving with any kind of imbalance :
- What metric to choose – It is not advisable to choose metric like accuracy with imbalanced dataset. Hence, one should choose the evaluation metric wisely.
- Bias – Classifiers are more sensitive to detect the majority class leading to biased classification output.
- Difficulty in getting more data – Applications like fraud detection, faulty machine detection etc offers a unique challenge that frauds or faults occur rarely and hence less data for that particular class.
Feel free to comment any other issue you faced.