How do you deal with dataset imbalance in a problem like spam filtering ?

Class imbalance is a very common problem when applying ML algorithms. Spam filtering is one such application where class imbalance is apparent. There are many more non-spam emails in a typical inbox than spam emails. The following approaches can be used to address the class imbalance problem.

  1. Designing an Asymmetric cost function where the cost of misclassifying a minority class is higher than the cost of misclassifying the majority class. Typically a scaling factor is assigned to the loss function terms belonging to the minority class, that can be adjusted during hyper parameter tuning. Note that the evaluation metric needs to be aligned as well to do hyper parameter tuning – for instance F1 score or AUC is a good measure over plain accuracy.
  2. Under-sampling the majority class :
    1. Remove randomly sampled data points.
    2. Cluster data points and remove points from large clusters with random sampling.
  3. Over-sampling the minority class :
    1. SMOTE(Synthetic Minority Over-sampling Technique) is a popular tool for oversampling the minority class. A random vector v is selected that lies between the given sample s and any of the k nearest neighbours of s. The vector v is multiplied with a random number between 0 and 1. And this vector is added to s to give the new synthetic data point. In a way, it is making the k nearest neighbour of samples(from minority) more dense by oversampling.  
    2. Randomly resampling data points. But remember resampling does not lead to enough independent data points to learn complex functions, but has the effect of assigning higher weight to some minority class data points.

Leave a Reply

Your email address will not be published. Required fields are marked *