28

The Magic Behind Embedding Models

 4 years ago
source link: https://towardsdatascience.com/the-magic-behind-embedding-models-c3af62f71fb?gi=1c2fd6558a9c
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Exploring the implementation of Word2vec and GloVe

qAF3air.jpg!web

Image by Natalia Y — Unsplash

What are Embeddings?

Embeddings are types of knowledge representation where each textual variable is represented with a vector (think about it as a list of numbers for now). A textual variable could be a word, node in a graph or a relation between two nodes in a knowledge graph. These vectors can be called different names such as space vectors, latent vectors, or embedding vectors. These vectors represent multidimensional feature space on which machine learning methods can be applied. Therefore, we need to make a shift in how we think about language from a sequence of words to points that occupy a high-dimensional semantic space where p oints in space can be close together or far apart.

Why do we need Embeddings?

The purpose of this representation is to get words with similar meanings (semantically related) to have a similar representation and be closer to each other after plotting them in a space. Why is that important? Well, for many reasons, mainly:

  1. Computers do not understand the text and the relations between words, so you need a way to represent these words with a number which is what computers understand.
  2. Embeddings can be used in many applications such as question answering systems, recommendations systems, sentiment analysis, text classification and it also makes it easier for search, return synonyms. Let us take a simple example to understand how embeddings help with all of that.

eQNZzmB.png!web

Image Source: (Embeddings: Translating to a Lower-Dimensional Space) by Google.

Simple Embeddings Example

For the sake of simplicity, let us start with this example, consider the words “king”, “queen”, “man”, and “woman” are represented with the vectors [9, 8, 7], [5, 6, 4], [5, 5, 5], and [1, 3, 2] respectively. Figure (1) depicted these vectors representation. Notice that the word “king” and the word “man” are semantically related in a way that both “man” and “king” represent a male human. However, the word “king” has an extra feature which is royalty. Similarly, the word “queen” is similar to “woman” but has an extra feature which is royalty as well.

Since the relation between “king” to “queen” (male royalty - female royalty) is similar to the relation between “man” and “woman” (male human - female human)then subtracting them from each other gives us this famous equation: (king - queen = man - woman). By the way, when subtracting two words from each other, we subtract their vectors.

The magic behind the embeddings

Suppose we do not know what is the female name for “king”, so how can we get it? Well, since we know that (king - queen = man - woman), we change the formula to be (queen = king - man + woman) which makes sense. The formula states if you remove the male gender from “king” (royalty is the reminder) then add the female gender to royalty to give us what we are looking for which is “queen”.

RJJnM3z.png!web
Image by ( Kawin Ethayarajh ), Why does King — Man + Woman = Queen? Understanding Word Analogies

Image is taken from ( Kawin Ethayarajh ), Why does King - Man + Woman = Queen? Understanding Word Analogies

Now we know embedding can be helpful in question answering systems. Other examples may be similar (USA - English = France - French), (Germany - Berlin = France - Paris). Moreover, embeddings are also helpful in simple recommendation tasks. For example, if someone likes “orange”, then we look at the most similar vectors to the vector that represents “orange” and we get the vectors for “apple”, “cherry”, and “banana”. As we can see, the better representation (list of numbers) we get for each word, the better accuracy our recommendation system gets. So the reminding question is how do we come up with this list of numbers for each word? (which is called embedding, latent or space vector).

Embedding categories

There are three main categories and we will discuss them one by one:

  1. Word Embeddings (Word2vec, GloVe, FastText, …)
  2. Graph Embeddings (DeepWalk, LINE, Node2vec, GEMSEC, …)
  3. Knowledge Graph Embeddings (RESCAL and its extensions, TransE and its extensions, …).

Word2vec

Word2vec is one of the earliest vectors that is mainly to embed words rather than sentences or books. Moreover, the dimension of Word2vec is not related to the number of words in the training data since it uses some algorithms to reduce the dimensions into (50, 100, 300, etc.). Word2vec falls under prediction based embeddings which tend to predict a word in a given context. Word2vec has two flavors: Continuous Bag Of Words (CBOW) and Skip-Gram model. CBOW tend to predict the probability of a word given a context, whereas skip-Gram uses opposite CBOW architecture (predict a context given a single word).

CBOW

We start by specifying a context window size which is the beginning and ending for each context. Then we get the One Hot Encoding vectors for each word. Given the corpus “I like driving fast cars”, the window size is 1 (1 word before and one word after the target word), vector dimension is 3 and we want to predict the middle word “driving” from the context “I ……. driving”. Notice that we have only 1 hidden layer where its size associated with the required vector dimension, that is the reason for calling this technique learning the representations of vectors because we have only 1 hidden layer. Bellow is the architecture, note that the input are the words in the context window size and the output is the learning the representation for the target word. Also note that there are no activation function applied to the hidden layer. However, the output layer utilizes Softmax.

iiUrU3i.png!web

The output of the previous neural network is the following weight matrix:

vAZb6vQ.png!web

After having the weight matrix we multiply the matrix with the One Hoe Encoding vector for the target word to get its representation vector.

IvQbQnZ.png!web

You may as why are we multiplying the weight matrix with a vector filled with zero’s with a single 1. Of course, the output is just its position in the matrix, consider the following example:

U3qAb2U.png!web

Well, the real purpose of this multiplication is just to lookup the target word vector based on its space in the One Hot Encoding vector.

Skip-Gram

Skip-Gram or sometimes called Skip-Ngram model uses a headed flipped architecture of CBOW and the rest are the same. Below is the architecture for skip-gram where we try to predict all the words within a window size given one context word:

j2eYV3j.png!web

GloVe: Global Vectors for Word Representation

GloVe is a word embedding model that is trained on the co-occurrence matrix counts. It use the corpus statistics by minimizing least square error in order to obtain the word vector space.

Co-occurrence matrix

Given a corpus having V words, our co-occurrence matrix X will be of size VxV where each word i in X is a unique word in the corpus and each word j denotes to the number of times occurred in the window size of word i . Given this sentence “the dog ran after the man” and a window size 1, we get the following matrix:

fmiaey2.png!web

Notice how the matrix is symmetric

Let us start with this simple formula:

Where P_ij refer to the probability of word j that appear in the context of word i . The formula denotes this probability where X_ij is the number of times j appeared in context of i and X_i is the total number of words appeared in context of i .

Moreover, we need a function F that takes the input embeddings for the words i, k, j (where k is the index of a context vector) and compute their output embeddings (expressed as w and w~ consequently). The main goal for GloVe was building meaningful embeddings using simple arithmetic operation which is making the input of F the difference between the vectors i and j:

With that being said, we still have a simple issue in our previous formula and that is the left hand side of the formula is a vector whereas the right hand side is just a scalar. To fix this issue mathematically, we use the dot product between the transpose of (w_i, w_j) and (w_k) to get the following:

While the distinction between context words and standard words is arbitrary in our co-occurrence matrix, we can replace the probabilities in our formula with their ratios:

Solve the equation to get:

Still one question, which is what is the function F could be? Let us say it the exp() function. Then we solve for:

Move log(X_i) to the left hand side:

To add a bias to our equation, we substitute log(X_i) to get:

Finally, GloVe efficiency relies on minimizing a linear regression function as its loss function.

Resources:


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK