One of the most simplest way of representing document as a vector is bag of words(BoW). Though it is the simplest approach but it leads to high dimensional vectors given the large vocabulary size. Some common ways for performing dimensionality reduction in NLP are :
- TF-IDF : Term frequency inverse document frequency is the best alternative to bag of words to construct feature vectors for sentences. However tf-idf can also lead to high dimensions but certainly less than obtained from BoW.
- Word2Vec / Glove : These are very popular recently that are obtained by leveraging word co-occurrence. These vectors are obtained using an encoder-decoder setting in a deep neural network. A document embedding is obtained by averaging embeddings of all words in the document.
- Elmo Embeddings are deep contextual embeddings. Elmo might give a slightly different embedding for each context a word occurs in.
- LSI : Latent semantic Indexing which is based on Singular Value Decomposition (SVD) and follows the same principle of word2vec, i.e. words used in same contexts have similar meanings.
- Topic Modeling : Techniques such as latent dirichlet allocation(LDA) find relevant topics in document collection. LDA represent the document as a reduced dimensional vector of topic strengths.