Latent Dirichlet Allocation(LDA) is a probabilistic model that models a document as a multinomial mixture of topics and the topics as a multinomial mixture of words. Each of these multinomials have a dirichlet prior. The goal is to learn these multinomial proportions using probabilistic inference techniques based on the observed data which is the words/content in the document.
Let there be documents, words in the vocabulary and be the number of topics we want to find. The LDA can be defined by the following generative process :
→ For each topic , is a topic specific multinomial distribution over words in the vocabulary. Note that represents the weightage given to word in topic . The multinomial is generated as .
→ For each document is a document specific multinomial where each component represents the weightage of topic in the document . The multinomial is generated as follows from a dirichlet distribution :
→ For each word in document a topic is generated from the document specific multinomial . Then depending on the topic, a word is generated as
Now, the learning happens through the task of inference (typically the techniques used for inference are Gibbs Sampling or Variational Inference) to learn the s and s based on the observed data i.e the words in the document.
The outcome of LDA is one multinomial for each document which is a low dimensional representation of the document and a multinomial for each topic. One can visualize the topic as a combination of all words whose weight is high in the corresponding multinomial.