82

Pretrained models — transformers 4.12.5 documentation

 2 years ago
source link: https://huggingface.co/transformers/pretrained_models.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Pretrained models

Here is a partial list of some of the available pretrained models together with a short presentation of each model.

For the full list, refer to https://huggingface.co/models.

Architecture

Model id

Details of the model

bert-base-uncased

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on lower-cased English text.

bert-large-uncased

24-layer, 1024-hidden, 16-heads, 336M parameters.
Trained on lower-cased English text.

bert-base-cased

12-layer, 768-hidden, 12-heads, 109M parameters.
Trained on cased English text.

bert-large-cased

24-layer, 1024-hidden, 16-heads, 335M parameters.
Trained on cased English text.

bert-base-multilingual-uncased

(Original, not recommended) 12-layer, 768-hidden, 12-heads, 168M parameters.
Trained on lower-cased text in the top 102 languages with the largest Wikipedias

(see details).

bert-base-multilingual-cased

(New, recommended) 12-layer, 768-hidden, 12-heads, 179M parameters.
Trained on cased text in the top 104 languages with the largest Wikipedias

(see details).

bert-base-chinese

12-layer, 768-hidden, 12-heads, 103M parameters.
Trained on cased Chinese Simplified and Traditional text.

bert-base-german-cased

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased German text by Deepset.ai

(see details on deepset.ai website).

bert-large-uncased-whole-word-masking

24-layer, 1024-hidden, 16-heads, 336M parameters.
Trained on lower-cased English text using Whole-Word-Masking

(see details).

bert-large-cased-whole-word-masking

24-layer, 1024-hidden, 16-heads, 335M parameters.
Trained on cased English text using Whole-Word-Masking

(see details).

bert-large-uncased-whole-word-masking-finetuned-squad

24-layer, 1024-hidden, 16-heads, 336M parameters.
The bert-large-uncased-whole-word-masking model fine-tuned on SQuAD

(see details of fine-tuning in the example section).

bert-large-cased-whole-word-masking-finetuned-squad

24-layer, 1024-hidden, 16-heads, 335M parameters
The bert-large-cased-whole-word-masking model fine-tuned on SQuAD

(see details of fine-tuning in the example section)

bert-base-cased-finetuned-mrpc

12-layer, 768-hidden, 12-heads, 110M parameters.
The bert-base-cased model fine-tuned on MRPC

(see details of fine-tuning in the example section)

bert-base-german-dbmdz-cased

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased German text by DBMDZ

(see details on dbmdz repository).

bert-base-german-dbmdz-uncased

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on uncased German text by DBMDZ

(see details on dbmdz repository).

cl-tohoku/bert-base-japanese

12-layer, 768-hidden, 12-heads, 111M parameters.
Trained on Japanese text. Text is tokenized with MeCab and WordPiece and this requires some extra dependencies,
fugashi which is a wrapper around MeCab.
Use pip install transformers["ja"] (or pip install -e .["ja"] if you install from source) to install them.

(see details on cl-tohoku repository).

cl-tohoku/bert-base-japanese-whole-word-masking

12-layer, 768-hidden, 12-heads, 111M parameters.
Trained on Japanese text. Text is tokenized with MeCab and WordPiece and this requires some extra dependencies,
fugashi which is a wrapper around MeCab.
Use pip install transformers["ja"] (or pip install -e .["ja"] if you install from source) to install them.

(see details on cl-tohoku repository).

cl-tohoku/bert-base-japanese-char

12-layer, 768-hidden, 12-heads, 90M parameters.
Trained on Japanese text. Text is tokenized into characters.

(see details on cl-tohoku repository).

cl-tohoku/bert-base-japanese-char-whole-word-masking

12-layer, 768-hidden, 12-heads, 90M parameters.
Trained on Japanese text using Whole-Word-Masking. Text is tokenized into characters.

(see details on cl-tohoku repository).

TurkuNLP/bert-base-finnish-cased-v1

12-layer, 768-hidden, 12-heads, 125M parameters.
Trained on cased Finnish text.

(see details on turkunlp.org).

TurkuNLP/bert-base-finnish-uncased-v1

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on uncased Finnish text.

(see details on turkunlp.org).

wietsedv/bert-base-dutch-cased

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on cased Dutch text.

(see details on wietsedv repository).

openai-gpt

12-layer, 768-hidden, 12-heads, 110M parameters.
OpenAI GPT English model

GPT-2

gpt2

12-layer, 768-hidden, 12-heads, 117M parameters.
OpenAI GPT-2 English model

gpt2-medium

24-layer, 1024-hidden, 16-heads, 345M parameters.
OpenAI’s Medium-sized GPT-2 English model

gpt2-large

36-layer, 1280-hidden, 20-heads, 774M parameters.
OpenAI’s Large-sized GPT-2 English model

gpt2-xl

48-layer, 1600-hidden, 25-heads, 1558M parameters.
OpenAI’s XL-sized GPT-2 English model

GPTNeo

EleutherAI/gpt-neo-1.3B

24-layer, 2048-hidden, 16-heads, 1.3B parameters.
EleutherAI’s GPT-3 like language model.

EleutherAI/gpt-neo-2.7B

32-layer, 2560-hidden, 20-heads, 2.7B parameters.
EleutherAI’s GPT-3 like language model.

Transformer-XL

transfo-xl-wt103

18-layer, 1024-hidden, 16-heads, 257M parameters.
English model trained on wikitext-103

XLNet

xlnet-base-cased

12-layer, 768-hidden, 12-heads, 110M parameters.
XLNet English model

xlnet-large-cased

24-layer, 1024-hidden, 16-heads, 340M parameters.
XLNet Large English model

xlm-mlm-en-2048

12-layer, 2048-hidden, 16-heads
XLM English model

xlm-mlm-ende-1024

6-layer, 1024-hidden, 8-heads
XLM English-German model trained on the concatenation of English and German wikipedia

xlm-mlm-enfr-1024

6-layer, 1024-hidden, 8-heads
XLM English-French model trained on the concatenation of English and French wikipedia

xlm-mlm-enro-1024

6-layer, 1024-hidden, 8-heads
XLM English-Romanian Multi-language model

xlm-mlm-xnli15-1024

12-layer, 1024-hidden, 8-heads
XLM Model pre-trained with MLM on the 15 XNLI languages.

xlm-mlm-tlm-xnli15-1024

12-layer, 1024-hidden, 8-heads
XLM Model pre-trained with MLM + TLM on the 15 XNLI languages.

xlm-clm-enfr-1024

6-layer, 1024-hidden, 8-heads
XLM English-French model trained with CLM (Causal Language Modeling) on the concatenation of English and French wikipedia

xlm-clm-ende-1024

6-layer, 1024-hidden, 8-heads
XLM English-German model trained with CLM (Causal Language Modeling) on the concatenation of English and German wikipedia

xlm-mlm-17-1280

16-layer, 1280-hidden, 16-heads
XLM model trained with MLM (Masked Language Modeling) on 17 languages.

xlm-mlm-100-1280

16-layer, 1280-hidden, 16-heads
XLM model trained with MLM (Masked Language Modeling) on 100 languages.

RoBERTa

roberta-base

12-layer, 768-hidden, 12-heads, 125M parameters
RoBERTa using the BERT-base architecture

(see details)

roberta-large

24-layer, 1024-hidden, 16-heads, 355M parameters
RoBERTa using the BERT-large architecture

(see details)

roberta-large-mnli

24-layer, 1024-hidden, 16-heads, 355M parameters
roberta-large fine-tuned on MNLI.

(see details)

distilroberta-base

6-layer, 768-hidden, 12-heads, 82M parameters
The DistilRoBERTa model distilled from the RoBERTa model roberta-base checkpoint.

(see details)

roberta-base-openai-detector

12-layer, 768-hidden, 12-heads, 125M parameters
roberta-base fine-tuned by OpenAI on the outputs of the 1.5B-parameter GPT-2 model.

(see details)

roberta-large-openai-detector

24-layer, 1024-hidden, 16-heads, 355M parameters
roberta-large fine-tuned by OpenAI on the outputs of the 1.5B-parameter GPT-2 model.

(see details)

DistilBERT

distilbert-base-uncased

6-layer, 768-hidden, 12-heads, 66M parameters
The DistilBERT model distilled from the BERT model bert-base-uncased checkpoint

(see details)

distilbert-base-uncased-distilled-squad

6-layer, 768-hidden, 12-heads, 66M parameters
The DistilBERT model distilled from the BERT model bert-base-uncased checkpoint, with an additional linear layer.

(see details)

distilbert-base-cased

6-layer, 768-hidden, 12-heads, 65M parameters
The DistilBERT model distilled from the BERT model bert-base-cased checkpoint

(see details)

distilbert-base-cased-distilled-squad

6-layer, 768-hidden, 12-heads, 65M parameters
The DistilBERT model distilled from the BERT model bert-base-cased checkpoint, with an additional question answering layer.

(see details)

distilgpt2

6-layer, 768-hidden, 12-heads, 82M parameters
The DistilGPT2 model distilled from the GPT2 model gpt2 checkpoint.

(see details)

distilbert-base-german-cased

6-layer, 768-hidden, 12-heads, 66M parameters
The German DistilBERT model distilled from the German DBMDZ BERT model bert-base-german-dbmdz-cased checkpoint.

(see details)

distilbert-base-multilingual-cased

6-layer, 768-hidden, 12-heads, 134M parameters
The multilingual DistilBERT model distilled from the Multilingual BERT model bert-base-multilingual-cased checkpoint.

(see details)

ctrl

48-layer, 1280-hidden, 16-heads, 1.6B parameters
Salesforce’s Large-sized CTRL English model

CamemBERT

camembert-base

12-layer, 768-hidden, 12-heads, 110M parameters
CamemBERT using the BERT-base architecture

(see details)

ALBERT

albert-base-v1

12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters
ALBERT base model

(see details)

albert-large-v1

24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M parameters
ALBERT large model

(see details)

albert-xlarge-v1

24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M parameters
ALBERT xlarge model

(see details)

albert-xxlarge-v1

12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters
ALBERT xxlarge model

(see details)

albert-base-v2

12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters
ALBERT base model with no dropout, additional training data and longer training

(see details)

albert-large-v2

24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M parameters
ALBERT large model with no dropout, additional training data and longer training

(see details)

albert-xlarge-v2

24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M parameters
ALBERT xlarge model with no dropout, additional training data and longer training

(see details)

albert-xxlarge-v2

12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters
ALBERT xxlarge model with no dropout, additional training data and longer training

(see details)

t5-small

~60M parameters with 6-layers, 512-hidden-state, 2048 feed-forward hidden-state, 8-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)

t5-base

~220M parameters with 12-layers, 768-hidden-state, 3072 feed-forward hidden-state, 12-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)

t5-large

~770M parameters with 24-layers, 1024-hidden-state, 4096 feed-forward hidden-state, 16-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)

t5-3B

~2.8B parameters with 24-layers, 1024-hidden-state, 16384 feed-forward hidden-state, 32-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)

t5-11B

~11B parameters with 24-layers, 1024-hidden-state, 65536 feed-forward hidden-state, 128-heads,
Trained on English text: the Colossal Clean Crawled Corpus (C4)

XLM-RoBERTa

xlm-roberta-base

~270M parameters with 12-layers, 768-hidden-state, 3072 feed-forward hidden-state, 8-heads,
Trained on on 2.5 TB of newly created clean CommonCrawl data in 100 languages

xlm-roberta-large

~550M parameters with 24-layers, 1024-hidden-state, 4096 feed-forward hidden-state, 16-heads,
Trained on 2.5 TB of newly created clean CommonCrawl data in 100 languages

FlauBERT

flaubert/flaubert_small_cased

6-layer, 512-hidden, 8-heads, 54M parameters
FlauBERT small architecture

(see details)

flaubert/flaubert_base_uncased

12-layer, 768-hidden, 12-heads, 137M parameters
FlauBERT base architecture with uncased vocabulary

(see details)

flaubert/flaubert_base_cased

12-layer, 768-hidden, 12-heads, 138M parameters
FlauBERT base architecture with cased vocabulary

(see details)

flaubert/flaubert_large_cased

24-layer, 1024-hidden, 16-heads, 373M parameters
FlauBERT large architecture

(see details)

facebook/bart-large

24-layer, 1024-hidden, 16-heads, 406M parameters

(see details)

facebook/bart-base

12-layer, 768-hidden, 16-heads, 139M parameters

facebook/bart-large-mnli

Adds a 2 layer classification head with 1 million parameters
bart-large base architecture with a classification head, finetuned on MNLI

facebook/bart-large-cnn

24-layer, 1024-hidden, 16-heads, 406M parameters (same as large)
bart-large base architecture finetuned on cnn summarization task

BARThez

moussaKam/barthez

12-layer, 768-hidden, 12-heads, 216M parameters

(see details)

moussaKam/mbarthez

24-layer, 1024-hidden, 16-heads, 561M parameters

DialoGPT

DialoGPT-small

12-layer, 768-hidden, 12-heads, 124M parameters
Trained on English text: 147M conversation-like exchanges extracted from Reddit.

DialoGPT-medium

24-layer, 1024-hidden, 16-heads, 355M parameters
Trained on English text: 147M conversation-like exchanges extracted from Reddit.

DialoGPT-large

36-layer, 1280-hidden, 20-heads, 774M parameters
Trained on English text: 147M conversation-like exchanges extracted from Reddit.

Reformer

reformer-enwik8

12-layer, 1024-hidden, 8-heads, 149M parameters
Trained on English Wikipedia data - enwik8.

reformer-crime-and-punishment

6-layer, 256-hidden, 2-heads, 3M parameters
Trained on English text: Crime and Punishment novel by Fyodor Dostoyevsky.

M2M100

facebook/m2m100_418M

24-layer, 1024-hidden, 16-heads, 418M parameters
multilingual machine translation model for 100 languages

facebook/m2m100_1.2B

48-layer, 1024-hidden, 16-heads, 1.2B parameters
multilingual machine translation model for 100 languages

MarianMT

Helsinki-NLP/opus-mt-{src}-{tgt}

12-layer, 512-hidden, 8-heads, ~74M parameter Machine translation models. Parameter counts vary depending on vocab size.

Pegasus

google/pegasus-{dataset}

16-layer, 1024-hidden, 16-heads, ~568M parameter, 2.2 GB for summary. model list

Longformer

allenai/longformer-base-4096

12-layer, 768-hidden, 12-heads, ~149M parameters
Starting from RoBERTa-base checkpoint, trained on documents of max length 4,096

allenai/longformer-large-4096

24-layer, 1024-hidden, 16-heads, ~435M parameters
Starting from RoBERTa-large checkpoint, trained on documents of max length 4,096

MBart

facebook/mbart-large-cc25

24-layer, 1024-hidden, 16-heads, 610M parameters
mBART (bart-large architecture) model trained on 25 languages’ monolingual corpus

facebook/mbart-large-en-ro

24-layer, 1024-hidden, 16-heads, 610M parameters
mbart-large-cc25 model finetuned on WMT english romanian translation.

facebook/mbart-large-50

24-layer, 1024-hidden, 16-heads,
mBART model trained on 50 languages’ monolingual corpus.

facebook/mbart-large-50-one-to-many-mmt

24-layer, 1024-hidden, 16-heads,
mbart-50-large model finetuned for one (English) to many multilingual machine translation covering 50 languages.

facebook/mbart-large-50-many-to-many-mmt

24-layer, 1024-hidden, 16-heads,
mbart-50-large model finetuned for many to many multilingual machine translation covering 50 languages.

Lxmert

lxmert-base-uncased

9-language layers, 9-relationship layers, and 12-cross-modality layers
768-hidden, 12-heads (for each layer) ~ 228M parameters
Starting from lxmert-base checkpoint, trained on over 9 million image-text couplets from COCO, VisualGenome, GQA, VQA

Funnel Transformer

funnel-transformer/small

14 layers: 3 blocks of 4 layers then 2 layers decoder, 768-hidden, 12-heads, 130M parameters

(see details)

funnel-transformer/small-base

12 layers: 3 blocks of 4 layers (no decoder), 768-hidden, 12-heads, 115M parameters

(see details)

funnel-transformer/medium

14 layers: 3 blocks 6, 3x2, 3x2 layers then 2 layers decoder, 768-hidden, 12-heads, 130M parameters

(see details)

funnel-transformer/medium-base

12 layers: 3 blocks 6, 3x2, 3x2 layers(no decoder), 768-hidden, 12-heads, 115M parameters

(see details)

funnel-transformer/intermediate

20 layers: 3 blocks of 6 layers then 2 layers decoder, 768-hidden, 12-heads, 177M parameters

(see details)

funnel-transformer/intermediate-base

18 layers: 3 blocks of 6 layers (no decoder), 768-hidden, 12-heads, 161M parameters

(see details)

funnel-transformer/large

26 layers: 3 blocks of 8 layers then 2 layers decoder, 1024-hidden, 12-heads, 386M parameters

(see details)

funnel-transformer/large-base

24 layers: 3 blocks of 8 layers (no decoder), 1024-hidden, 12-heads, 358M parameters

(see details)

funnel-transformer/xlarge

32 layers: 3 blocks of 10 layers then 2 layers decoder, 1024-hidden, 12-heads, 468M parameters

(see details)

funnel-transformer/xlarge-base

30 layers: 3 blocks of 10 layers (no decoder), 1024-hidden, 12-heads, 440M parameters

(see details)

LayoutLM

microsoft/layoutlm-base-uncased

12 layers, 768-hidden, 12-heads, 113M parameters

(see details)

microsoft/layoutlm-large-uncased

24 layers, 1024-hidden, 16-heads, 343M parameters

(see details)

DeBERTa

microsoft/deberta-base

12-layer, 768-hidden, 12-heads, ~140M parameters
DeBERTa using the BERT-base architecture

(see details)

microsoft/deberta-large

24-layer, 1024-hidden, 16-heads, ~400M parameters
DeBERTa using the BERT-large architecture

(see details)

microsoft/deberta-xlarge

48-layer, 1024-hidden, 16-heads, ~750M parameters
DeBERTa XLarge with similar BERT architecture

(see details)

microsoft/deberta-xlarge-v2

24-layer, 1536-hidden, 24-heads, ~900M parameters
DeBERTa XLarge V2 with similar BERT architecture

(see details)

microsoft/deberta-xxlarge-v2

48-layer, 1536-hidden, 24-heads, ~1.5B parameters
DeBERTa XXLarge V2 with similar BERT architecture

(see details)

SqueezeBERT

squeezebert/squeezebert-uncased

12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone.
SqueezeBERT architecture pretrained from scratch on masked language model (MLM) and sentence order prediction (SOP) tasks.

squeezebert/squeezebert-mnli

12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone.
This is the squeezebert-uncased model finetuned on MNLI sentence pair classification task with distillation from electra-base.

squeezebert/squeezebert-mnli-headless

12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone.
This is the squeezebert-uncased model finetuned on MNLI sentence pair classification task with distillation from electra-base.
The final classification layer is removed, so when you finetune, the final layer will be reinitialized.

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK