Explain latent dirichlet allocation – where is it typically used ?

Latent Dirichlet Allocation(LDA) is a probabilistic model that models a document as a multinomial mixture of topics and the topics as a multinomial mixture of words. Each of these multinomials have a dirichlet prior. The goal is to learn these multinomial proportions using probabilistic inference techniques based on the observed data which is the words/content in the document.

Let there be M  documents, V words in the vocabulary and K be the number of topics we want to find. The LDA can be defined by the following generative process :

→ For each topic k, \phi_k = (\phi_{k1}, \hdots, \phi_{kV}) is a topic specific multinomial distribution over V words in the vocabulary. Note that \phi_{kv} represents the weightage given to word v in topic k. The multinomial \phi_k is generated as \phi_k \sim Dir(\beta).  

→ For each document j, \theta_j=(\theta_{j1}, \hdot, \theta_{jK}) is a document specific multinomial  where each component \theta_{jk} represents the weightage of topic k in the document j. The multinomial \theta_j is generated as follows from a dirichlet distribution : \theta_j \sim Dir(\alpha)

→ For each word i in document j a topic z_{ji} \sim \theta_j is generated from the document specific multinomial \theta_j. Then depending on the topic, a word w_{ji} is generated as w_{ji} \sim \phi_{z_{ji}}

Now, the learning happens through the task of inference (typically the techniques used for inference are Gibbs Sampling or Variational Inference) to learn the \theta_js and \phi_ks based on the observed data i.e the words in the document.

The outcome of LDA is one multinomial for each document which is a low dimensional representation of the document and a multinomial for each topic. One can visualize the topic as a combination of all words whose weight is high in the corresponding multinomial.

Leave a Reply

Your email address will not be published. Required fields are marked *