TransMI: A Framework to Create Strong Baselines from Multilingual   Pretrained Language Models for Transliterated Data

Yihong Liu; Chunlan Ma; Haotian Ye; Hinrich Sch\"utze

arXiv:2405.09913·cs.CL·December 17, 2024

TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data

Yihong Liu, Chunlan Ma, Haotian Ye, Hinrich Sch\"utze

PDF

Open Access 1 Repo

TL;DR

TransMI is a framework that leverages existing multilingual pretrained language models to effectively handle transliterated data across scripts without retraining, significantly improving crosslingual transfer performance.

Contribution

It introduces a simple, training-free method to adapt mPLMs for transliterated data by transliterating vocabularies and merging embeddings, enabling effective cross-script transfer.

Findings

01

TransMI improves crosslingual transfer by 3% to 34%.

02

It preserves the ability to handle non-transliterated data.

03

The framework is applicable to multiple strong mPLMs.

Abstract

Transliterating related languages that use different scripts into a common script is effective for improving crosslingual transfer in downstream tasks. However, this methodology often makes pretraining a model from scratch unavoidable, as transliteration brings about new subwords not covered in existing multilingual pretrained language models (mPLMs). This is undesirable because it requires a large computation budget. A more promising way is to make full use of available mPLMs. To this end, this paper proposes a simple but effective framework: Transliterate-Merge-Initialize (TransMI). TransMI can create strong baselines for data that is transliterated into a common script by exploiting an existing mPLM and its tokenizer without any training. TransMI has three stages: (a) transliterate the vocabulary of an mPLM into a common script; (b) merge the new vocabulary with the original…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cisnlp/transmi
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques