How to do error analysis efficiently in machine learning?

Error analysis is required to improve the machine learning model performance. Quite often, model may not perform to its maximum performance. This is true even after several iterations of cross validation. Hence, error analysis should be performed to find out the root cause or causes of bad performance.

Consider a sample application like building a model for classifying into two classes. If the model doesn’t perform well, in other words, error metric like accuracy is bad, one of the possible solutions is to collect more data. However, collecting more data might take several months that delays the delivery of the project. This is not at all efficient but instead the following approach should be implemented by manual error analysis.

  1. Select 100-200 mislabeled samples from the dev set. Read here to find out what is dev set.
  2. Do manual error analysis by virtue of the table shown below. This table analysis is more structured and efficient as it helps in determining what kind of data to collect for improving model performance.
Error analysis using structured analysis of mislabeled samples: 

Below table is constructed from 100-200 mislabeled samples from the dev set. Analysis is performed to find out various issues in the mislabeled samples. Instead of collecting data blindly that might be a costly and time consuming operation, one can get the right and most impactful direction from such analysis.

error analysis for machine learning model
Structured and efficient error analysis. Assume data samples consist of images. Each row is analysis of one of the data sample that got mislabeled. Each column is one of the multiple possibilities that can improve model performance.

Therefore one should not blindly collect more data in order to improve model performance. But perform the error analysis as described above by making a table using sample examples from the mislabeled dev set. There will be multiple ideas that can lead to improved model. One can execute such ideas or strategies in parallel. For example, data collection for clear images because of blurred issue and data collection for single object instead of multiple objects in the image can happen simultaneously.

This may not be the only way to do the analysis but it can prove to be more efficient than shooting in the dark.

Leave a Reply

Your email address will not be published. Required fields are marked *