Out of vocabulary(OOV) words are words that are not in the training set, but appear in the test set or the real data. The main problem is that the model assigns a probability zero to OOV words resulting in a zero likelihood. This is a common problem, specially when you have not trained on a smaller data set. There are many techniques to handle OOV words :
- Typically a special OOV token is added to the language model. Often the first word in the document is treated as the OOV word ensure the OOV occurs somewhere in the training data and gets a positive probability.
- Smoothing is the common technique applied in language models, where we add a constant term in the numerator and denominator while estimating probabilities of words to ensure none of the probabilities go to 0. Read this article or this one for more details. This trick can be applied to unigram as well as to n-gram smoothing.
- Another common trick, particularly when working with word embedding based solutions is to replace the word with a nearby word from some form of synonym dictionary. Example : ‘I want to know what you are consuming’. Suppose consuming is not in the vocabulary, replace it with ‘I want to know what you are eating’. Take a look at the following article to understand more about this technique.