The Transformer: A Quick Run Through

Explore the best of natural language modeling enabled by the Transformer. Understand its architecture and internal working.

Mandar Deshpande

May 2 ·7min read

This is Part 3 of the5 part series on language modeling.

Seq2seq task of machine translation solved using Transformer-like architecture (BERT) ( translate.google.com )

Introduction

In the previous post , we looked at how ELMo and ULMFiT boosted the prominence of language model pre-training in the community. This blog assumes that you have read through the previous two parts of this series and thus builds upon that knowledge.

English input being translated to German output using the Transformer model (Mandar Deshpande)

The Transformer has been seen as the model which finally removed the limitations of sequence model training through recurrent neural networks. The idea picked up in language modeling and machine translation around the use of encoder-decoder stacking turned out to be valuable learning in the process of building this architecture. The Transformer is a simple network architecture solely based on attention mechanism and giving away any kind of recurrence and convolutions entirely. It has been shown to generalize well to other language understanding and modeling tasks, with large and limited training data. It also achieved the state of the art results on the English-to-German translation task and anchored itself as the go-to architecture for future advancements in model pre-training in NLP.

Encoder-Decoder Architecture

QR7R73U.png!web

The 6 encoder-decoder architecture used in the Transformer (Mandar Deshpande)

In this model, multiple encoders are stacked on top of each other, and similarly, decoders are stacked together. Usually, each encoder/decoder comprises recurrent connections and convolution, and the hidden representation from each encoder stage is passed ahead to be used by the next layer. Most seq2seq tasks can easily be solved using such a stack of encoders-decoders which processes each word in the input sequence in order.

Attention Mechanism

Since attention mechanism has become an integral part of sequence modeling and transduction models in various tasks allow modeling dependencies without regard to their distance in the input or output sequences. To put it in simple terms; the attention mechanism helps us tackle long-range dependency issues in neural networks without the use of recurrent neural networks (RNN). This solves the exact purpose addressed by hidden state shared across all time steps in RNN, through the use of encoder-decoder based architecture. The attention model focuses on the relevant part of the input text sequence or image as per the task being solved.

In a regular RNN, context is passed in terms of the final hidden state produced by the encoders and uses it to produce the next token of the translation or text.

UJRV7fq.png!web

Regular seq2seq models without Attention Mechanism only uses the last hidden state as the context vector (Mandar Deshpande)

Steps involved in generating the Context Vector:

Initialize the context vector of random values and size as per the task (eg 128, 256, 512)
Process one token from the input sequence through the encoder
Use the hidden state representation in the encoder to update the context vector
Keep repeating Step 2 and 3 until the entire input sequence is processed

Once the context vector has been fully updated, it is passed to the decoder as an additional input to the word/token being translated. The context vector is a useful abstraction, except that it acts as a bottleneck for the representation of the entire meaning of the input sequence.

Instead of passing a single context vector to the decoder, the attention mechanism passes all the intermediate hidden states within a stack of encoders to the decoder . This enables the decoder to focus on different parts of the input sequence as per the relevance of the current word/token being processed.

Unlike the previous seq2seq models, attention models perform 2 extra steps:

More data is passed from the encoder to the decoder
The decoder in an attention model uses this additional data to focus on a particular word from the input sequence and uses the hidden state with the highest softmax score as the context vector

77jQ7rZ.png!web

Attention Mechanism used to create the context vector passed to the decoder (Mandar Deshpande)

Peek Inside the Transformer

The Transformer consists of 6 stacked encoders and 6 stacked decoders to form the main architecture of the model. This number can be variable as per the use-case, but 6 has been used in the original paper .

Let us consider a single encoder and decoder stack to simplify our understanding of the working.

ZzIJBzF.png!web

Components inside the Encoder and Decoder in the Transformer (Mandar Deshpande)

Architecture

Each encoder consists of a Self-Attention layer followed by the Feed Forward network. Usually in attention mechanisms, hidden states from the previous states are utilized for the calculation of attention. Instead, self-attention uses trained embeddings from the same layer to compute the attention vector. To elucidate, self-attention could be thought of as a mechanism for coreference resolution within a sentence:

“The man was eating his meal while he was thinking about his family”

In the above sentence, the model needs to build an understanding of what he refers to, and that it is a coreference to the man. This is enabled by the self-attention mechanism in the Transformer. A detailed discussion on self-attention (using multiple heads) is beyond the scope of this blog and can be found in the original paper.

The decoder also has the same two layers as the encoder, except that additional encoder-decoder attention is introduced in between to help the model extract relevant features from attention vectors from the encoder.

Bbauq2V.png!web

Simplified 2 encoders stacked together with 2 decoders to explore internal architecture (Mandar Deshpande)

Point-wise Feed-Forward Networks

It is important to notice that each word in the input sequence shares the computation in the self-attention layer, but each word flows through a separate feed-forward network. The output from the feed-forward network is passed on to the next encoder in the stack which utilizes this learned context from previous encoders.

Positional Encoding

To embed a sense of time in the input sequence, each word is concatenated with a positional encoding. This augmented input word embedding is passed as input to Encoder 1. Since the model doesn’t use any recurrence or convolution, positional encodings encode some information about the relative position in the input sentence.

Residual Connections with Normalization

The output from the self-attention layer is added with the original word embedding using residual connections and layer normalization. A similar scheme is followed by the feed-forward layer.

Fully Connected Linear with Softmax

Once a point vector is given out by the final decoder in the stack, it needs to be converted into the translated word. Now that we already have all the required information embedded as floats in this output vector, we just need to convert it to a probability over possible next word in the translation.

The fully connected linear network converts the float vector into scores which are transformed into probability values using the softmax function. The index with the highest softmax value is chosen and retrieved from the output vocabulary learned from the training set.

Transformer Training Explained

The training is supervised i.e. uses labeled training dataset which can be used as a benchmark for comparison and correction of output word probabilities.

Essentially, each word in the translated output vocabulary is converted into a one-hot vector that is 1, only at the index where the word is present and 0 everywhere else. Now, once we receive the softmax output vector comprised of normalized probability values, we can compare it with the one-hot vectors to improve model parameters/weights.

These two vectors can be compared by using some similarity metrics like cosine similarity, cross-entropy, and/or Kullback-Leibler divergence . At the beginning of the training process, the output probability distribution is much further off than the ground truth one-hot vector. As training proceeds and the weights get optimized, the output word probabilities closely track the ground truth vectors.

The Transformer: A Quick Run Through