Word2Vec is a Feed forward neural network based model to find word embeddings. The Skip-gram model, modelled as predicting the context given a specific word, takes the input as each word in the corpus, sends them to a hidden layer (embedding layer) and from there it predicts the context words. Once trained, the embedding for a particular word is obtained by feeding the word as input and taking the hidden layer value as the final embedding vector.
GloVe, global vectors, is based on matrix factorization techniques on the word-context matrix. It first constructs a large matrix of (words x context) co-occurrence information, i.e. for each “word” (the rows), you count how frequently we see this word in some “context” (the columns) in a large corpus. The number of “contexts” is of course large, since it is essentially combinatorial in size. So then we factorize this matrix to yield a lower-dimensional (word x features) matrix, where each row now yields a vector representation for each word. In general, this is done by minimizing a “reconstruction loss”. This loss tries to find the lower-dimensional representations which can explain most of the variance in the high-dimensional data.
Loss function is given by
As you can see the loss function is a squared loss but the loss is weighted by function , as shown in the below figure. This ensures too frequent words like stop-words do not get too much weight. is the count when and occur together. and correspond to the matrix factorization task. and are the bias terms. Note that instead of , one can also have PMI between and .
The original paper on GloVe can be found here.
The word vectors in an abstract way represent different facets of the meaning of a word. Some notable properties are :
- Such word vectors are good at answering analogy questions. The relationship between words is derived by distance between words.
- We can also use element-wise addition of vector elements to ask questions such as ‘German + airlines’