GitHub - ngoanpv/albert_vi: ALBERT for Vietnamese
source link: https://github.com/ngoanpv/albert_vi
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
README.md
ALBERT for Vietnamese
Introduction
ALBERT is "A Lite" version of BERT, a popular unsupervised language representation learning algorithm. ALBERT uses parameter-reduction techniques that allow for large-scale configurations, overcome previous memory limitations, and achieve better behavior with respect to model degradation.
For a technical detail description of the algorithm, see the paper:
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations (Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut) and the official repository Google ALBERT
Google researchers introduced three standout innovations with ALBERT. [1]
-
Factorized embedding parameterization: Researchers isolated the size of the hidden layers from the size of vocabulary embeddings by projecting one-hot vectors into a lower dimensional embedding space and then to the hidden space, which made it easier to increase the hidden layer size without significantly increasing the parameter size of the vocabulary embeddings.
-
Cross-layer parameter sharing: Researchers chose to share all parameters across layers to prevent the parameters from growing along with the depth of the network. As a result, the large ALBERT model has about 18x fewer parameters compared to BERT-large.
-
Inter-sentence coherence loss: In the BERT paper, Google proposed a next-sentence prediction technique to improve the model’s performance in downstream tasks, but subsequent studies found this to be unreliable. Researchers used a sentence-order prediction (SOP) loss to model inter-sentence coherence in ALBERT, which enabled the new model to perform more robustly in multi-sentence encoding tasks.
We preproduced ALBERT for Vietnamese dataset and provided pre-trained model in below.
Data preparation
Training data is the Vietnamese wikipedia corpus from Wikipedia
Data is preprocessed and extracted using WikiExtractor
Training SentencePiece model for producing vocab file, we used 30000 words from this model on Vietnamese wikipedia corpus.
SentencePice model and vocab at folder assets.
Pretraining
Creating data for pretraining
We trained ALBERT model on config version 2 of base and large.
Base Config
{ "attention_probs_dropout_prob": 0, "hidden_act": "gelu", "hidden_dropout_prob": 0, "embedding_size": 128, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "max_position_embeddings": 512, "num_attention_heads": 12, "num_hidden_layers": 12, "num_hidden_groups": 1, "net_structure_type": 0, "gap_size": 0, "num_memory_blocks": 0, "inner_group_num": 1, "down_scale_factor": 1, "type_vocab_size": 2, "vocab_size": 30000 }
Large Config
{ "attention_probs_dropout_prob": 0, "hidden_act": "gelu", "hidden_dropout_prob": 0, "embedding_size": 128, "hidden_size": 1024, "initializer_range": 0.02, "intermediate_size": 4096, "max_position_embeddings": 512, "num_attention_heads": 16, "num_hidden_layers": 24, "num_hidden_groups": 1, "net_structure_type": 0, "gap_size": 0, "num_memory_blocks": 0, "inner_group_num": 1, "down_scale_factor": 1, "type_vocab_size": 2, "vocab_size": 30000 }
Create tfrecord for training data:
python create_pretraining_data.py \ --input_file={path to wiki data} \ --dupe_factor=10 \ --output_file={path to save tfrecord} \ --vocab_file assets/albertvi_30k-clean.vocab \ --spm_model_file assets/albertvi_30k-clean.model
Pre-training base config
python run_pretraining.py \ --albert_config_file=assets/base/albert_config.json \ --input_file={tfrecord path} \ --output_dir={}\ --export_dir={}\ --train_batch_size=4096 \ --do_eval=True \ --use_tpu=True \
Pre-training large config
python run_pretraining.py \ --albert_config_file=assets/large/albert_config.json \ --input_file={tfrecord path} \ --output_dir={}\ --export_dir={}\ --train_batch_size=512 \ --do_eval=True \ --use_tpu=True \
Pretrained model
We run ~1M steps for base config and ~250k for large config.
Eval result at step 1001000
***** Eval results ***** global_step = 1001000 loss = 1.6706645 masked_lm_accuracy = 0.66281766 masked_lm_loss = 1.6631233 sentence_order_accuracy = 0.9998438 sentence_order_loss = 0.00065024174 sentence_order_loss = 0.00065024174
You could download the pretrained models of base config at here
Experimential Results
Coming soon
Acknowledgement
Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
Thank so much @lampts, @dal team for suporting me to finish this project.
Conclusion
I hope to receiving contributions and feedback form everyone, Email me or create an issue for any questions .
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK