How do you handle missing data in an ML algorithm ?

There is no fixed rule to deal with missing data but use the heuristics mentioned above.

  1. The most common way of dealing with missing data is to remove all rows with missing data if there are not too many rows with missing data.
  2. If more than 50-60% of rows of a specific column are missing data, it is common to remove the column. The main problem with removing missing data thus, is that it could  introduce substantial bias.
  3. Imputation of data is also a common technique used  to deal with missing data where the data is substituted with the best guess.
    1. Imputation with mean : Missing data is replaced by mean of the column
    2. Imputation with median : Missing data is replaced by median of the column
    3. Imputation with Mode: Missing data is replaced with mode of the column
    4. Imputation with linear regression : With real valued data, this is a common technique. The missing value is replaced by performing linear regression based on the other feature values.

Leave a Reply

Your email address will not be published. Required fields are marked *