What is negative sampling when training the skip-gram model ?

Recap: Skip-Gram model tries to represent each word in a large text as a lower dimensional vector in a space of K dimensions such that similar words are closer to each other. This is achieved by training a feed-forward network where we try to predict the context words given a specific word, i.e., 

    \[p(w_{i-h},...w_{i+h}|w_i) = \prod_{-h \le k \le h, k \ne 0} p(w_{i+k}|w_i)\]

modelled as

    \[p(u|v)\,=\,\frac{exp(<\phi_{u},\theta_{v}>)}{\sum_{u^{'}\epsilon W} exp(<\phi_{u^{'},\theta_{v}}>)}\]

where u is the context word and v is a specific word

Why is it slow:  In this architecture, a soft-max function(above expression on RHS) is used to predict each context word. In practice, soft-max function is very slow in computation, especially for large vocabulary size.

Resolution :

  • The objective function is reconstructed to treat the problem as a classification problem where pairs of words and its context form single training example. A given word and the corresponding context word are included in positive examples and a given word with non-context words are negative examples.
  • While there can be a limited number of positive examples, there are many negative examples. Hence a randomly sampled set of negative examples are taken for each word when crafting the objective function.

This algorithm/model is called Skip Gram Negative Sampling(SGNS)

Leave a Reply

Your email address will not be published. Required fields are marked *