Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training
Bo Zheng, Li Dong, Shaohan Huang, Saksham Singhal, Wanxiang Che, Ting, Liu, Xia Song, Furu Wei

TL;DR
This paper introduces VoCap, an algorithm for allocating vocabulary capacity across languages in cross-lingual models, and proposes k-NN-based sampling to speed up training while maintaining performance.
Contribution
The paper presents VoCap for dynamic vocabulary allocation and a k-NN-based sampling method to accelerate training in cross-lingual models.
Findings
VoCap improves vocabulary representation for under-represented languages.
k-NN-based sampling speeds up pre-training without performance loss.
Multilingual vocabularies learned with VoCap enhance cross-lingual model performance.
Abstract
Compared to monolingual models, cross-lingual models usually require a more expressive vocabulary to represent all languages adequately. We find that many languages are under-represented in recent cross-lingual language models due to the limited vocabulary capacity. To this end, we propose an algorithm VoCap to determine the desired vocabulary capacity of each language. However, increasing the vocabulary size significantly slows down the pre-training speed. In order to address the issues, we propose k-NN-based target sampling to accelerate the expensive softmax. Our experiments show that the multilingual vocabulary learned with VoCap benefits cross-lingual language model pre-training. Moreover, k-NN-based target sampling mitigates the side-effects of increasing the vocabulary size while achieving comparable performance and faster pre-training speed. The code and the pretrained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗HIT-SCIR/Chinese-Mixtral-8x7B-adaptermodel· ♡ 1♡ 1
- 🤗HIT-SCIR/Chinese-Mixtral-8x7Bmodel· 8.2k dl· ♡ 458.2k dl♡ 45
- 🤗LoneStriker/Chinese-Mixtral-8x7B-2.4bpw-h6-exl2model
- 🤗LoneStriker/Chinese-Mixtral-8x7B-3.0bpw-h6-exl2model· 5 dl5 dl
- 🤗LoneStriker/Chinese-Mixtral-8x7B-3.5bpw-h6-exl2model· 1 dl1 dl
- 🤗LoneStriker/Chinese-Mixtral-8x7B-3.75bpw-h6-exl2model· 1 dl1 dl
- 🤗LoneStriker/Chinese-Mixtral-8x7B-5.0bpw-h6-exl2model
- 🤗LoneStriker/Chinese-Mixtral-8x7B-6.0bpw-h6-exl2model· 3 dl3 dl
- 🤗RichardErkhov/HIT-SCIR_-_Chinese-Mixtral-8x7B-ggufmodel· 310 dl310 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
