Extending the Subwording Model of Multilingual Pretrained Models for New   Languages

Kenji Imamura; Eiichiro Sumita

arXiv:2211.15965·cs.CL·November 30, 2022

Extending the Subwording Model of Multilingual Pretrained Models for New Languages

Kenji Imamura, Eiichiro Sumita

PDF

Open Access 1 Repo

TL;DR

This paper proposes a method to extend multilingual pretrained models to new languages by adding subwords to the tokenizer without retraining the entire model, demonstrated on English-Inuktitut translation.

Contribution

It introduces a technique to incorporate new languages into existing multilingual models by expanding the tokenizer, avoiding full retraining of the model.

Findings

01

Successfully added Inuktitut to mBART-50 without altering existing language segmentations.

02

Enabled effective English-Inuktitut translation using the extended model.

03

Maintained segmentation consistency for pretrained languages while adding new ones.

Abstract

Multilingual pretrained models are effective for machine translation and cross-lingual processing because they contain multiple languages in one model. However, they are pretrained after their tokenizers are fixed; therefore it is difficult to change the vocabulary after pretraining. When we extend the pretrained models to new languages, we must modify the tokenizers simultaneously. In this paper, we add new subwords to the SentencePiece tokenizer to apply a multilingual pretrained model to new languages (Inuktitut in this paper). In our experiments, we segmented Inuktitut sentences into subwords without changing the segmentation of already pretrained languages, and applied the mBART-50 pretrained model to English-Inuktitut translation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kenji-imamura/sentpiece_mimic
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsByte Pair Encoding · SentencePiece