Breaking the Script Barrier in Multilingual Pre-Trained Language Models   with Transliteration-Based Post-Training Alignment

Orgest Xhelili; Yihong Liu; Hinrich Sch\"utze

arXiv:2406.19759·cs.CL·October 10, 2024

Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment

Orgest Xhelili, Yihong Liu, Hinrich Sch\"utze

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a transliteration-based post-training alignment method to enhance cross-lingual transfer in multilingual models, especially for languages with different scripts, resulting in significant performance improvements.

Contribution

The paper proposes a novel transliteration-based post-pretraining alignment technique to improve cross-lingual transfer for languages with different scripts, demonstrating substantial gains in various tasks.

Findings

01

Models outperform original models by up to 50% after PPA.

02

Significant improvements when using non-English source languages.

03

Effective across diverse language groups and downstream tasks.

Abstract

Multilingual pre-trained models (mPLMs) have shown impressive performance on cross-lingual transfer tasks. However, the transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language, even though the two languages may be related or share parts of their vocabularies. Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method aiming to improve the cross-lingual alignment between languages using diverse scripts. We select two areal language groups, $Mediterranean-Amharic-Farsi$ and $South+East Asian Languages$ , wherein the languages are mutually influenced but use different scripts. We apply our method to these language groups and conduct extensive experiments on a spectrum of downstream…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cisnlp/transliteration-ppa
pytorchOfficial

Videos

Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification