What is the problem in random or uniform sampling of test set from the entire dataset ?

  • Just randomly sampling the test set can introduce sampling bias as the test set might not be representative of entire population. For ex. taking the example of predicting the winning party in an election, suppose there are 30% rural class voters and 70% are from urban class. If we sample test set uniformly, these proportions will be 50% for each class but this is not representative of entire population. To avoid this one must use stratified sampling as explained here.
  • sklearn has a function for stratified sampling called StratifiedShuffleSplit imported in this way in python “from sklearn.model_selection import StratifiedShuffleSplit”. For more on its usage and parameters visit here.
  • Read this question too for more understanding.

Leave a Reply

Your email address will not be published. Required fields are marked *