What are the advantages and disadvantages using bag of words feature vector?

Bag of Words(BoW) model is a common way of representing text data as input feature vector to an ML model. Each document is encoded as a V dimensional feature vector, where V is the vocabulary size. Each dimension in the feature vector contains the count of number of times the word(corresponding to the dimension) occurs in the document. Therefore, for each dimension, only one value will be nonzero. 

Given the above definition, here are the advantages and disadvantages of BoW approach :

Advantages:

  1. Very simple to understand and implement.

Disadvantages:

  1. Bag of words leads to a high dimensional feature vector due to large size of Vocabulary, V.
  2. Bag of words doesn’t leverage co-occurrence statistics between words. In other words, it assumes all words are independent of each other. 
  3. It leads to a highly sparse vectors as there is nonzero value in dimensions corresponding to words that occur in the sentence.

Example:

Suppose the vocabulary contains the words : { and, cat, dog, jumped, sat, over, ran, the }. You have the following sentence:  “The fox jumped over the dog and the dog ran”. Bag of words representation for this toy example: [1 0 2 1 0 1 1 3].

Nonzero values:

As the word and occurs only once in the sentence, there is value of 1 for the feature and. The word dog occurs twice in the sentence and hence a value of 2 for the feature dog.

Zero values:

There is no word cat in the sentence and hence a 0 for the feature catIn a real dataset, the vocabulary contains 50K to 100K words leading to extremely high dimensional sparse vectors. Techniques for dimensionality reduction are typically used to handle bag of words vectors.   

Where does it fail ?

When we want to capture more context (what word appeared  after some other word) and not just co-occurance in the same document. Sometimes bag of bigrams are used to capture some context, though they are very expensive. 

 

Leave a Reply

Your email address will not be published. Required fields are marked *