Data Science interview questions covering Machine Learning , Deep Learning, Natural Language Processing and more.
Say you have generated a language model using Bag of Words (BoW) with 1-hot encoding and your training set has lot of sentences with the word “good” but none with the word “great”. Suppose there is a sentence “Have a great day” then p(great)=0.0 using this training set. How can you solve this problem leveraging the fact that good and great are similar words?
Bag of Words(BoW) with 1-hot encoding doesn’t capture the meaning of sentences, it only captures co-occurrence statistics. We need to build the language model using features which are representative of the meaning of the words.
A simple solution could be to cluster the word embeddings and group synonyms into a unique token. Alternately, when a word has zero probability, try to look for the probability of a synonym instead.