Machine Learning Evaluation Metrics

The purpose of any Machine Learning algorithm is to predict right value/class for the unseen data. This is called generalization and ensuring this, in general, can be very tricky. This can depend on the algorithm being used for both supervised and unsupervised learning tasks. There are two things to consider in this process – the dataset to test on and the evaluation metric.  

Usually some portion of the entire dataset, called test set, is left over to evaluate the model for its generalizability. Then the evaluation metric is chosen depending upon the problem in hand and the model being built. It will be clear soon how can a problem determine the metric and not just the model.

Evaluation metric for Regression:

Root mean square error(RMSE) is the most common evaluation metric for regression models. RMSE is calculated by taking the square root of the mean of the square of difference between actual value and the predicted value. Lesser the RMSE, better the model.

    \[RMSE = \sqrt(\frac{\sum_{i=1}^{n}( (y^{actual}_i - y^{predicted}_i )^2}{n})\]

Mean Absolute Error (MAE) is a variant of RMSE and defined as the mean of absolute differences between the actual and predicted value.

    \[MAE = \frac{\sum_{i=1}^{n} |y^{actual}_i - y^{predicted}_i |}{n}\]

For more metrics in regression models, please visit here.

Evaluation metric for Classification:

There are several metrics possible in classification models and more often it is decided by the problem in hand. For eg, you should not use accuracy measure when dataset is highly skewed, i.e., one class has more number of examples(say 70% of all) than the other(30% of all).  

Let us understand these metrics for binary classification and call one class as positive and other as negative. Then, 

  1. TP represents the true positives, which is the number of positive predictions which are actually positive.
  2. FP represents the false positives, which is the number of negative predictions incorrectly classified as a positive ,i.e. they were identified as positives though they were from a negative class.
  3. TN represents the true negatives, which is the number of negative predictions correctly classified  as negative.
  4. FN represents the false negatives which is the number of positive instances incorrectly identified into the negative class, i.e. they were identified as negative but in reality they were  from the positive class.

Given the above definitions of four parameters, following metrics can be used for evaluation.

Confusion Matrix is just a way to observe all the above metrics defined. Figure 1 shows confusion matrix for binary classification but it can be extended for more classes as its size will become k by k for k-class problem. Note that all the diagonal numbers would be more desirable for better models. But in practice, non diagonal entries will be non-zero and depending upon the problem, one would minimize either of false positives or false negatives. For eg, if the problem is to predict if a patient has cancer(positive class) or not(negative class), minimizing FN is more important than minimizing FP. Because FN is the case when a patient had cancer but was predicted not having cancer! Other similar examples where FN is more important than FP are faulty machine(or parts of machine) prediction.

Example where minimizing FP is more important than FN is the problem of anomaly detection on an e-commerce website when an anomaly belongs to the positive class. FP is the case when a genuine user was blocked from buying products as it was labelled as an anomaly.

Interesting to note that in credit card fraud detection(another anomaly detection problem), FN is more important as a single transaction can bankrupt the genuine user!

These examples should convince you how important it is to know the problem before deciding the metric.

Class Predicted Class
Actual Class Negative Class(N) Positive Class(P)
Negative Class(N) TN FP
Positive Class(P) FN TP

Fig 1. Confusion Matrix

Accuracy is the proportion of correct labelled examples, i.e.,

    \[Accuracy\ =\ \frac{TP + TN}{TP + TN + FP + FN}\]


    \[Accuracy\ =\ \frac{ \sum_{i=1}^{n} I_{ (y_{i}^{actual}= y_{i}^{predicted}) } }{n}\]

I_{ (y_{i}^{actual}= y_{i}^{predicted}) }\ =\ 1 if it’s a correct prediction else 0

This measure should be used when all the classes are in same proportion and should not be used for skewed dataset.

Precision is the fraction of correct positives among the total predicted positives. It is also called the accuracy of positive predictions.

    \[Precision\ =\ \frac{TP}{TP + FP}\]

Recall is the fraction of correct positives among the total positives in the dataset. It is indicating how many total positives of the actual dataset were covered(classified correctly) while doing prediction.

    \[Recall\ =\ \frac{TP}{TP + FN}\]

Usually there is a precision-recall tradeoff as evident from figure 2. The figure  shows how precision decreases when one tries to increase the recall and vice versa. 

Some problems will require higher precision and some problems will prefer higher recall. More often the right balance needs to be discovered between both which is achieved through F1 Score.

Precision Recall tradeoff
Fig 2. Precision vs Recall

F1 Score is defined as the harmonic mean between precision and recall. It is useful for finding the right balance between both precision and recall.

    \[F1\ Score\ =\ \frac{2*Precision*Recall}{Precision+Recall}\]

ROC is the Region Operating Characteristic curve which is a non numeric way of evaluating the classification algorithm, just like confusion matrix. ROC curve is plotted with False Positive Rate (FPR) on x-axis and True Positive Rate(TPR) on the y-axis as shown in Figure 3. 

    \[TPR\ =\ \frac{TP}{TP + FN}\]

 TP + FN denotes the total number of actual positive records in the dataset. 

    \[FPR\ =\ \frac{FP}{TN + FP}\]

TN + FP denotes the total number of actual negative records in the dataset. Note that Recall is same as TPR and also called as Sensitivity. Specificity is defined as 1-FPR. 

ROC curve explaining how FPR varies with TPR
Fig 3. ROC Curve

In the above figure, the dashed line represents the random classifier whose area under the curve is 0.5. Any trained model should be above this dashed line and more closer to the top left corner of the plot. More closer the curve is to the top left corner, better the classification model is. To quantify this fact, another metric which is numeric and defined on ROC curve is AUC as defined below. 

AUC is a numeric metric which is the Area under the curve of ROC curve. Higher the area better the model is. As mentioned earlier, AUC for a random classifier is 0.5 and any trained model should obviously have AUC greater than 0.5. 

As a rule of thumb, prefer PR curve(Fig 2) whenever the positive class is rare or when you care more about the false positives than the false negatives and the ROC curve(Fig 3) otherwise. 

Evaluation metric for Clustering: 

Note that one can use the metrics mentioned in classification for clustering too as there are definite number of classes in the clustering problem. Instead of prediction, here it is assignment to one of the cluster. Treating this assignment as prediction will make it easier to understand the metrics mentioned for classification. However, clustering is an unsupervised task when true labels are not available to measure the model performance. Below mentioned metrics are used in clustering with k means.

Inertia measures the distance between each instance and its centroid. Mathematically inertia is the sum of the squared distances between each training instance and its closest centroid. Lesser the inertia better the clustering is.  In fact right value of K in K-means is chosen by plotting inertia on y-axis against different values of K on x-axis. The curve formed this way is expected to be in elbow shape and hence this method of choosing K is called elbow method.

K means elbow method to determine right value of K
Fig 4. Optimal value of K using elbow method

Silhouette score is the mean silhouette coefficient over all the instances. An instance’s silhouette coefficient is equal to (b−a)/max(a,b) where a is the mean distance to the other instances in the same cluster (it is the mean intra-cluster distance), and b is the mean nearest-cluster distance, that is the mean distance to the instances of the next closest cluster (defined as the one that minimizes b, excluding the instance’s own cluster). The silhouette coefficient can vary between -1 and +1: a coefficient close to +1 means that the instance is well inside its own cluster and far from other clusters, while a coefficient close to 0 means that it is close to a cluster boundary, and finally a coefficient close to -1 means that the instance may have been assigned to the wrong cluster.

Silhoutte score to find the right value of K in K means clustering
Fig 5. Silhoutte score at different values of K

It might look like a difficult task of implementing the above mentioned metrics for the problem you’re working on. But fortunately, scikit-learn(sklearn) has in built functions to compute all the above mentioned metrics. 

So before accepting your machine learning model, do not forget to measure its performance by either plotting or calculating a numeric metric. Good luck!

Leave a Reply

Your email address will not be published. Required fields are marked *