Say you have generated a language model using Bag of Words (BoW) with 1-hot encoding and your training set has lot of sentences with the word “good” but none with the word “great”. Suppose there is a sentence “Have a great day” then p(great)=0.0 using this training set. How can you solve this problem leveraging the fact that good and great are similar words?

  • Bag of Words(BoW) with 1-hot encoding doesn’t capture the meaning of sentences, it only captures co-occurrence statistics. We need to build the language model using features which are representative of the meaning of the words.
  • A simple solution could be to cluster the word embeddings and group synonyms into a unique token. Alternately, when a word has zero probability, try to look for the probability of a synonym instead.
  • A more principled approach is to build a language model using Distributed representations like probabilistic neural language model
  • Other workarounds for the zero probability problem involve various kinds of smoothing, though they do not leverage the semantic closeness of similar words.

Leave a Reply

Your email address will not be published. Required fields are marked *