Language ModelingII: ULMFiT and ELMo

This is Part 2 of the 4 part series on language modeling.

7R73ayB.png!web

Language models actively used in search engines

Introduction

In the previouspost, we understood the concept of language modeling and the way it differs from regular pre-trained embeddings like word2vec and GloVe.

On our journey to towards REALM (Retrieval-Augmented Language Model Pre-Training), we will briefly walk through these seminal works on language models:

ELMo : Embeddings from Language Models
ULMFiT : Universal Language Model Fine-Tuning method

ELMo: Embeddings from Language Models (2018)

Pre-trained word embeddings like word2vec and GloVe are a crucial element in many neural language understanding models. If we stick to using GloVe embeddings for our language modeling task, then the word ‘major’ would have the same representation irrespective of whether it appeared in any context. Context plays a major role for humans to perceive what a said word means.

Eg. ‘major: an army officer of high rank’ and ‘major: important, serious, or significant’ would have the same embedding for the word ‘major’ being used according to GloVe vectors.

The task of creating such high-quality representations is hard. To make it concrete, any word representation should model:

Syntax and Semantics: complex characteristics of word use
Polysemy : the coexistence of many possible meanings for a word or phrase across linguistics contexts

ELMo introduces a deep contextualized word representation that tackles the tasks we defined above while still being easy to integrate into existing models. This achieved the state of the art results on a range of demanding language understanding problems like question answering, NER, Coref, and SNLI.

Contextualized word embeddings

Representations that capture both the word meaning along with the information available in the context are referred to as contextual embeddings. Unlike word2vec or GloVe which utilizes a static word representation, ELMo utilizes bi-directional LSTM for specific tasks to look at the whole sentence before encoding a word.

Much like we observed in the [previous article](insert link), ELMo’s LSTM is trained on an enormous text dataset (in the same language as our downstream task). Once thispre-training has been done, we can reuse these distilled word embeddings as a building block for another language or NLP task.

73iaQzV.png!web

Unrolled Forward Language Model used in ELMo (Mandar Deshpande)

How do we train the model on this huge dataset?

We simply train our model to predict the next word given a sequence words i.e. language modeling itself. Furthermore, we can easily do this because we already have this dataset without needing explicit labels as needed in other supervised learning tasks.

ELMo Architecture

Consisting of one forward and one backward language model, ELMo’s hidden states have access to both the next word and the previous world. Each hidden layer is a bidirectional LSTM, such that its language model can view hidden states from either direction. You can look at the figure above to understand how this LSTM has access to other hidden states.

7FVvEvj.png!web

Hidden Layer Concatenation and Summation for (kth) token specific embedding in ELMo ( Mandar Deshpande)

Once the forward and backward language models have trained, ELMo concatenates the hidden layer weights together into a single embedding. Furthermore, each such weight concatenation is multiplied with a weight based on the task being solved.

As you can see above ELMo takes a summation of these concatenated embeddings and assigns it to a particular token being processed from the input text. ELMo represents a token t_k as a linear combination of corresponding hidden layers (including its embedding). This means that each token in the input text has a personalized embedding being assigned by ELMo.

yYz67nb.png!web

Integrating EMLo into other NLP tasks by concatenation to embeddings (Mandar Deshpande)

Once ELMo’s biLMs (bi-directional language models) have been trained on a huge text corpus, it can be integrated into almost all neural NLP tasks with simple concatenation to the embedding layer.

The higher layers seem to learn semantics while the lower layer probably captures syntactic features. Additionally, ELMo-enhanced models can make use of small datasets more efficiently.

You can read more about ELMo here .

ULMFiT (2018)

Before ULMFiT, inductive transfer learning was widely being used in computer vision, but existing approaches in NLP still required task-specific modifications and training from scratch. ULMFiT proposed an effective transfer learning method that can be applied to any NLP task and further demonstrated techniques that are key to fine-tuning a language model.

Instead of random initialization of model parameters, we can reap the benefits of pre-training and speed up the learning process.

Regular LSTM units are used for the 3 layer architecture of ULMFiT, taking a cue from AWD-LSTM .

The three stages of ULMFiT comprise of:

General Domain LM Pre-Training: The language model is trained on a general-domain corpus to capture general features of language in different layers
Target task Discriminative Fine-Tuning: The trained language model is fine-tuned on a target task dataset using discriminative fine-tuning and learning rate schedules (slanted triangular learning rate) to learn task-specific features
Target task Classifier Fine-Tuning: Fine-tuning the classifier on the target task using gradual unfreezing and repeating stage 2. This helps the network to preserve low-level representations and adapt to high-level ones.

Vn22eau.png!web

Three Stages of ULMFiT (Mandar Deshpande)

As we can see above Stage 1 uses the same learning rate across all layers, whereas Stage 2 and 3 have layer-wise triangular learning rate schedules. Also, note how the layer weights are gradually reaching the optimal values across the three-stage process. (the darker color is optimal for representation purposes)

Discriminative fine-tuning(learning schedule for stage 2/3 with slanted triangular learning rate) is a major revelation of this paper as it draws from the intuition that different layers in a model capture different types of features. Therefore it makes sense to have different learning rates for each of them. Like computer vision, even in language modeling tasks, the initial layers capture the most general information about the language, and hence once pre-trained would require the lowest amount of fine-tuning.

After Stage 2 of the process, the model is already very close to the optimal weights needed for the specified task, hence target task classifier fine-tuning is said to be very sensitive. If the fine-tuning process changes the weights significantly at this stage, then all the benefits of model pre-training would be lost. To take care of this issue, gradual unfreezing is proposed in the paper:

To start, the last LSTM layer is unfrozen and model is fine-tuned just for one epoch
Next, layer before the last is unfrozen and fine-tuned
A similar process is repeated for each layer until convergence

You can read the paper here .

Hopefully, this blog was useful for you to build a basic understanding around this exciting field of pre-trained language models!

In the next blog, we will be discussing Transformers and BERT for learning fine-tunable pre-trained models.

Introduction

ELMo: Embeddings from Language Models (2018)

ULMFiT (2018)

Recommend

Evolution of NLP — Part 3 — Transfer Learning Using ULMFit

Preprocessor Embed and Language Embed - The Last Sprint

Understanding and Evolving the Rust Programming Language

Tickle Me Kaczynski: How The Inventor Of The Ultimate Elmo Toy Became A Unabombe...

词向量系列（6）：动态词向量CoVe之ELMo

Header 2 -ELMO注意事项

Attention（四）——BERT, ELMo, GPT, ERNIE

Elmo Software shares pop 28% on takeover talks

The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)

Elmo asked how everyone’s doing and, um, they’re not great!

About Joyk