Say you have generated a language model using Bag of Words (BoW) with 1-hot encoding and your training set has lot of sentences with the word “good” but none with the word “great”. Suppose there is a sentence “Have a great day” then p(great)=0.0 using this training set. How can you solve this problem leveraging the fact that good and great are similar words?

Bag of Words(BoW) with 1-hot encoding doesn’t capture the meaning of sentences, it only captures co-occurrence statistics. We need to build the language model using features which are representative of the meaning of the words. A simple solution could be to cluster the word embeddings and group synonyms into a unique token. Alternately, when a…

What are the different ways of representing documents ?

Bag of words : commonly called BoW, creates a vocabulary of words and represent the document as a count vector. The number of dimensions are equivalent to the vocabulary size where each dimension represents the number of times a specific word occurred in the document. Sometimes, TF-IDF is used to reduce the dimensionality of the…

Why are bigrams or any n-grams important in NLP(task like sentiment classification or spam detection)  or important enough to find them explicitly?

There are mainly 2 reasons Some pair of words always occur together more often than they occur individually. Hence it is important to treat such co-occurring words as a single entity or a single token in training. For named entity recognition problem, Tokens such as “United States”, “North America”, “Red Wine” would make sense when…