TL;DR
This paper introduces RelateLM, a method leveraging language relatedness, transliteration, and pseudo translation to improve language model adaptation for low-resource Indian languages, demonstrating effective transfer from related high-resource languages.
Contribution
Proposes RelateLM, a novel approach that exploits language relatedness, script transliteration, and pseudo translation for low-resource language model adaptation.
Findings
RelateLM improves performance on low-resource Indian languages.
Using related languages as pivots enhances transfer learning.
Transliteration and pseudo translation effectively augment data.
Abstract
Recent research in multilingual language models (LM) has demonstrated their ability to effectively handle multiple languages in a single model. This holds promise for low web-resource languages (LRL) as multilingual models can enable transfer of supervision from high resource languages to LRLs. However, incorporating a new language in an LM still remains a challenge, particularly for languages with limited corpora and in unseen scripts. In this paper we argue that relatedness among languages in a language family may be exploited to overcome some of the corpora limitations of LRLs, and propose RelateLM. We focus on Indian languages, and exploit relatedness along two dimensions: (1) script (since many Indic scripts originated from the Brahmic script), and (2) sentence structure. RelateLM uses transliteration to convert the unseen script of limited LRL text into the script of a Related…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
