There is no fixed rule to deal with missing data but use the heuristics mentioned above.
- The most common way of dealing with missing data is to remove all rows with missing data if there are not too many rows with missing data.
- If more than 50-60% of rows of a specific column are missing data, it is common to remove the column. The main problem with removing missing data thus, is that it could introduce substantial bias.
- Imputation of data is also a common technique used to deal with missing data where the data is substituted with the best guess.
- Imputation with mean : Missing data is replaced by mean of the column
- Imputation with median : Missing data is replaced by median of the column
- Imputation with Mode: Missing data is replaced with mode of the column
- Imputation with linear regression : With real valued data, this is a common technique. The missing value is replaced by performing linear regression based on the other feature values.