9

Battle of the Transformers: ELECTRA, BERT, RoBERTa, or XLNet

 4 years ago
source link: https://mc.ai/battle-of-the-transformers-electra-bert-roberta-or-xlnet/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

One of the “secrets” behind the success of Transformer models is the technique of Transfer Learning. In Transfer Learning, a model (in our case, a Transformer model) is pre-trained on a gigantic dataset using an unsupervised pre-training objective. This same model is then fine-tuned (typically supervised training) on the actual task at hand. The beauty of this approach is that the fine-tuning dataset can be as small as 500–1000 training samples! A number small enough to be potentially scoffed out of the room if one were to call it Deep Learning. This also means that the expensive and time-consuming part of the pipeline, pre-training , only needs to be done once and the pre-trained model can be reused for any number of tasks thereafter. Since pre-trained models are typically made publically available , we can grab the relevant model, fine-tune it on a custom dataset, and have a state-of-the-art model ready to go in a few hours!

If you are interested in learning how pre-training works and how you can train a brand new language model on a single GPU, check out my article linked below!

ELECTRA is one of the latest classes of pre-trained Transformer models released by Google and it switches things up a bit compared to most other releases. For the most part, Transformer models have followed the well-trodden path of Deep Learning, with larger models, more training, and bigger datasets equalling better performance. ELECTRA, however, bucks this trend by outperforming earlier models like BERT while using less computational power, smaller datasets, and less training time. (In case you are wondering, ELECTRA is the same “size” as BERT).

In this article, we’ll look at how to use a pre-trained ELECTRA model for text classification and we’ll compare it to other standard models along the way. Specifically, we’ll be comparing the final performance (Matthews correlation coefficient (MCC) ) and the training times for each model listed below.

  • electra-small
  • electra-base
  • bert-base-cased
  • distilbert-base-cased
  • distilroberta-base
  • roberta-base
  • xlnet-base-cased

As always, we’ll be doing this with the Simple Transformers library (based on the Hugging Face Transformers library) and we’ll be using Weights & Biases for visualizations.

You can find all the code used here in the examples directory of the library.

Installation

  1. Install Anaconda or Miniconda Package Manager from here .
  2. Create a new virtual environment and install packages.
    conda create -n simpletransformers python pandas tqdm
    conda activate simpletransformers
    conda install pytorch cudatoolkit=10.1 -c pytorch
  3. Install Apex if you are using fp16 training. Please follow the instructions here .
  4. Install simpletransformers.
    pip install simpletransformers

Data Preparation

We’ll be using the Yelp Review Polarity dataset which is a binary classification dataset. The script below will download it and store it in the data directory. Alternatively, you can manually download the data from FastAI .

Hyperparameters

Once the data is in the data directory, we can start training our models.

Simple Transformers models can be configured extensively (see docs ), but we’ll just be going with some basic, “good enough” hyperparameter settings. This is because we are more interested in comparing the models to each other on an equal footing, rather than trying to optimize for the absolute best hyperparameters for each model.

With that in mind, we’ll increase the train_batch_size to 128 and we’ll increase the num_train_epochs to 3 so that all models will have enough training to converge.

One caveat here is that the train_batch_size is reduced to 64 for XLNet as it cannot be trained on an RTX Titan GPU with train_batch_size=128 . However, any effect of this discrepancy is minimized by setting gradient_accumulation_steps to 2 , which changes the effective batch size to 128 . (Gradients are calculated and the model weights are updated only once for every two steps)

All other settings which affect training are unchanged from their defaults.

Training the Models

Setting up the training process is quite simple. We just need the data loaded into Dataframes and the hyperparameters defined and we are off to the races!

For convenience, I’m using the same script to train all models as we only need to change the model names between each run. The model names are supplied by a shell script which also automatically runs the training script for each model.

The training script is given below:

Note that the Yelp Reviews Polarity dataset uses the labels [1, 2] for positive and negative, respectively. I’m changing this to [0, 1] for negative and positive, respectively. Simple Transformers requires the labels to start from 0 (duh!) and a label of 0 for negative sentiment is a lot more intuitive (in my opinion).

The bash script which can automate the entire process:

Note that you can remove the saved models at each stage by adding rm -r outputs to the bash script. This might be a good idea if you don’t have much disk space to spare.

The training script will also log the evaluation scores to Weights & Biases, letting us compare models easily.

For more information on training classification models, check out the Simple Transformers docs .

Results

You can find all my results here . Try playing around with the different graphs and information available!

Let’s go through the important results.

Final Scores

These are the final MCC scores obtained by each model. As you can see, the scores are quite close to each other for all the models.

To get a better view of the differences, the chart below zooms into the X-axis and shows only the range 0.88–0.94.

Note that a zoomed-in view, while helpful for spotting differences, can distort the perception of the results. Therefore, the chart below is for illustrative purposes only. Beware the graph that hides its zeros!

The roberta-base model leads the pack with xlnet-base close behind. The distilroberta-base and the electra-base models follow next, with barely anything between them. Honestly, the difference between the two is probably more due to random chance than anything else in this case. Bringing up the rear, we have bert-base-cased , distilbert-base-cased , and electra-small respectively.

Looking at the actual values shows close they are.

In this experiment, RoBERTa seems to outperform the other models. However, I’m willing to bet that with some tricks like hyperparameter tuning and ensembling, the ELECTRA model is capable of making up the difference. This is confirmed by the current GLUE benchmark leaderboard where ELECTRA is sitting above RoBERTa.

It is important to keep in mind that the ELECTRA model required substantially less pre-training resources (about a quarter) compared to RoBERTa. This is true for distilroberta-base as well. Even though the distilroberta-base model is comparatively smaller, you need the original roberta-base model before you can distil it into distilroberta-base .

The XLNet model is nearly keeping pace with the RoBERTa model but it requires far more computational resources than all other models shown here (see training time graph).

The venerable (although less than two years old) BERT model is starting to show its age and is outperformed by all but the electra-small model.

The electra-small model, although not quite matching the standards of the other models, still performs admirably. As might be expected, it trains the fastest, has the smallest memory requirements and is the fastest at inference.

Speaking of training times…

The speed of training is determined mostly by the size (number of parameters) of the model, except in the case of XLNet. The training algorithm used with XLNet makes it significantly slower than the comparative BERT, RoBERTa, and ELECTRA models, despite having roughly the same number of parameters. The GPU memory requirement for XLNet is also higher compared to the other models tested here, necessitating the use of a smaller training batch size as noted earlier (64 compared to 128 for the other models).

The inference times (not tested here) should also follow this general trend.

Finally, another important consideration is how quickly each of the models converges. All these model were trained for 3 full epochs without using early stopping.

Evidently, there is no discernible difference between the models with regard to how many training steps are required for convergence. All the models seem to be converging around 9000 training steps. Of course, the time taken to converge would vary due to the difference in training speed.

Conclusion

It’s a tough call to choose between different Transformer models. However, we can still gain some valuable insights from the experiment we’ve seen.

  • ELECTRA models can be outperformed by older models depending on the situation. But, it’s strength lies in its ability to reach competitive performance levels with significantly less computational resources used for pre-training .
  • The ELECTRA paper indicates that the electra-small model significantly outperforms a similar-sized BERT model.
  • Distilled versions of Transformer models sacrifice a few accuracy points for the sake of quicker training and inference. This may be a desirable exchange in some situations.
  • XLNet sacrifices speed of training and inference in exchange for potentially better performance on complex tasks.

Based on these insights, I can offer the following recommendations (although they should be taken with a grain of salt as results may vary between different datasets).

distilroberta-base

It would be interesting to see if the large models also follow this trend. I hope to test this out in a future article (where T5 might also be thrown into the mix)!

If you would like to see some more in-depth analysis regarding the training and inference speeds of different models, check out my earlier article (sadly, no ELECTRA) linked below.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK