Allocating Large Vocabulary Capacity for Cross-lingual Language Model   Pre-training

Bo Zheng; Li Dong; Shaohan Huang; Saksham Singhal; Wanxiang Che; Ting; Liu; Xia Song; Furu Wei

arXiv:2109.07306·cs.CL·September 16, 2021

Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training

Bo Zheng, Li Dong, Shaohan Huang, Saksham Singhal, Wanxiang Che, Ting, Liu, Xia Song, Furu Wei

PDF

Open Access 2 Repos 9 Models

TL;DR

This paper introduces VoCap, an algorithm for allocating vocabulary capacity across languages in cross-lingual models, and proposes k-NN-based sampling to speed up training while maintaining performance.

Contribution

The paper presents VoCap for dynamic vocabulary allocation and a k-NN-based sampling method to accelerate training in cross-lingual models.

Findings

01

VoCap improves vocabulary representation for under-represented languages.

02

k-NN-based sampling speeds up pre-training without performance loss.

03

Multilingual vocabularies learned with VoCap enhance cross-lingual model performance.

Abstract

Compared to monolingual models, cross-lingual models usually require a more expressive vocabulary to represent all languages adequately. We find that many languages are under-represented in recent cross-lingual language models due to the limited vocabulary capacity. To this end, we propose an algorithm VoCap to determine the desired vocabulary capacity of each language. However, increasing the vocabulary size significantly slows down the pre-training speed. In order to address the issues, we propose k-NN-based target sampling to accelerate the expensive softmax. Our experiments show that the multilingual vocabulary learned with VoCap benefits cross-lingual language model pre-training. Moreover, k-NN-based target sampling mitigates the side-effects of increasing the vocabulary size while achieving comparable performance and faster pre-training speed. The code and the pretrained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications