How do you detect outliers in data ? How do you deal with them ?

This is a very common problem in almost any Machine Learning problem. There is no one fixed solution to this but heuristics depending upon the problem and the data. There are two types of outliers – univariate and multivariate. Univariate outliers exist when one of the feature value is deviating from other data points on the same feature value. When data has less number of dimensions, one can find univariate outliers by plotting the data and detecting the outliers if they lie far apart from most of the data. One such visualization is box plot where outliers will be visible in dots or points and majority of the data will be inside the box. Multivariate outliers can be found out by looking at n-dimensional feature set which is difficult for humans. Though bivariate outliers can be detected using scatter plots. Automated methods to detect outliers include Z-score, Probabilistic Modeling, Clustering, Linear Regression models etc. 
The most simplest method is Z-score which indicates how many standard deviations far is the data point from the mean assuming gaussian distribution. Z-score is useful for parametric distributions in low dimensional space.
DBSCAN is a density based clustering method useful for outlier detection. Points which do not get assigned to any cluster or form their own clusters are labelled outliers.
Isolation forest is designed for outlier detection which is based on decision tree and more precisely random forests. This follows the mechanism of decision tree by splitting the dataset on random feature at first. At every split, this split is repeated with other random features. Number of splittings made by the algorithm is the path length for a fixed data point. Outliers are expected to have shorter path lengths and stay closer to the root.  OneClass SVM, variant of SVM, is an outlier detection method. SVM is sensitive to outliers which is used to its advantage by OneClass SVM.

Please visit this page for more explanation on DBSCAN and Isolation forest.  

Leave a Reply

Your email address will not be published. Required fields are marked *