Imbalanced dataset or imbalanced class problem offers various challenges. One of the many possible ways of solving this problem is via oversampling from the minority class. However, oversampling without addressing the following issues can be dangerous: Usually, we begin with splitting entire dataset into training and testing set. Training set is further split into training…

# Author: InterviewBuddy

## What are the challenges of imbalanced dataset in machine learning?

Many machine learning problems come with an issue of imbalance dataset. This could be either due to the property of the problem itself or because of the way data has been collected. For eg, applications like fraud detection has relatively less frauds compared to normal transactions. Such problems are in the first category. In other…

## Error analysis in supervised machine learning

Every supervised learning problem encounters either bias or variance error. Please refer to this page if you want to get more intuition about bias and variance error as it will help in understanding this post. Once you know where(bias or variance) your model is doing wrong, it becomes easier to get the next direction. This…

## How to handle incorrectly labeled samples in the training or dev set ?

While doing error analysis, it might be revealed that your dataset has incorrectly labelled samples. These incorrectly labelled samples can be present in training set, dev set or test set. Note that dev set is also called as validation set. Incorrect labels in training set: There are two possibilities when the incorrect labels exist in…

## How to do error analysis efficiently in machine learning?

Error analysis is required to improve the machine learning model performance. Quite often, model may not perform to its maximum performance. This is true even after several iterations of cross validation. Hence, error analysis should be performed to find out the root cause or causes of bad performance. Consider a sample application like building a…

## What is Bayes Error ? What is the best approximation to bayes error ?

Bayes error is the lowest possible error one can achieve on a set of data samples. Suppose there are 10000 images of an object like chair and the machine learning task is to detect those objects. We find out that the best achievable accuracy is 99.25% by anyone in the world. Bayes error in this…

## What are the drawbacks of an n-gram language model ?

n-gram language model is a non deep learning method to generate language model. Probability of a word, w, (after sequence of 2 words) for a 2 gram model is given by – P(w | “word_1 word_2”) = count( “word_1 word_2 w”) / count( “word_1 word_2” ) , where “word_1 word_2” is ordered sequence of two…

## What is the best strategy for choosing evaluation metric ?

Any machine learning model has an evaluation stage. There are various metrics possible, however one must follow the below mentioned rules as one of the best strategies: Application level tradeoffs influence the ML level tradeoffs which in turn leads to multiple metrics. Always have one metric for optimising and for rest put some constraints. As…

## Why is named entity recognition hard ?

Named entity recognition is the problem to find and classify a name in text. Consider a sentence “State Bank of India provides good interest rates for the National Public School” Hard to work out boundaries of entity. For example, we don’t know which among “State Bank of India” or “State Bank” is the entity. Hard…

## What is cross entropy loss in deep learning?

Cross Entropy loss, serving as a loss function, is heavily used in deep learning models. This is derived from information theory. To explain the cross entropy, let true probability distribution be p computed model probability be q Then cross entropy loss or error is given by H(p,q) as: Cross entropy measures how is predicted probability…