Out of vocabulary(OOV) words are words that are not in the training set, but appear in the test set or the real data. The main problem is that the model assigns a probability zero to OOV words resulting in a zero likelihood. This is a common problem, specially when you have not trained on a…
Tag: smoothing
Given a bigram language model, in what scenarios do we encounter zero probabilities? How should we handle these situations ?
Recall the Bi-gram model can be expressed as : Following scenarios can lead to zero probability in the above expression : Out of vocabulary(OOV) words – such words may not be present during training and hence any probability term involving OOV words will be 0.0 leading entire term to be zero. This is solved…
Why is smoothing applied in language model ?
Smoothing is applied because of the following reason: There might be some n-grams in the test set but may not be present in the training set. For ex., If the training corpus is and you need to find the probability of a sequence like where <START> is the token applied…