Why are bigrams or any n-grams important in NLP(task like sentiment classification or spam detection)  or important enough to find them explicitly?

  • There are mainly 2 reasons
    1. Some pair of words always occur together more often than they occur individually. Hence it is important to treat such co-occurring words as a single entity or a single token in training. For named entity recognition problem, Tokens such as “United States”, “North America”, “Red Wine” would make sense when recognised as bi-grams. n-grams are an extension of bi-grams to work with longer sequences.
    2. If we take Bag of Words features, using only single words loses the ordering of the sequence. To preserve the ordering, one can use n-grams also as features in BoW approach.
  • Note that Frequently occurring sequence of words(not only bi-grams) are called Collocations.
  • NLTK provides a function called collocations() to find frequent bigrams.