What are the different ways of representing documents ?

  1. Bag of words : commonly called BoW, creates a vocabulary of words and represent the document as a count vector. The number of dimensions are equivalent to the vocabulary size where each dimension represents the number of times a specific word occurred in the document. Sometimes, TF-IDF is used to reduce the dimensionality of the number of dimensions by taking only those words that are relevant.  
  2. Aggregated word embeddings : Use word embeddings such as word2vec / glove for each word in the document. And the document embedding is the average of embeddings of all words in the document. This works well for short documents. For long document, there are problems due to the averaging effect. Advantage is that one could use pre-trained embeddings such as those from google news dataset.
  3. Phrase embeddings and document embeddings : There are many techniques to do embeddings of the entire document. One technique is to feed the sentence into an RNN with memory such as an LSTM network. And then take the contents of the last hidden unit as a representation of the entire sentence. The hidden layer keeps getting richer and richer along the sequence.
  4. Directly use the sequence of words as input to a deep learning model such as an LSTM for the end task.

Leave a Reply

Your email address will not be published. Required fields are marked *