TL;DR
IndicBART is a compact, multilingual pre-trained model tailored for Indic languages that leverages script similarities to enhance natural language generation tasks like translation and summarization, especially in low-resource settings.
Contribution
This paper introduces IndicBART, a novel pre-trained sequence-to-sequence model specifically designed for Indic languages, utilizing script sharing to improve transfer learning and performance.
Findings
IndicBART performs competitively with larger models like mBART50.
It excels in low-resource translation scenarios.
Script sharing and multilingual training enhance model efficiency.
Abstract
In this paper, we study pre-trained sequence-to-sequence models for a group of related languages, with a focus on Indic languages. We present IndicBART, a multilingual, sequence-to-sequence pre-trained model focusing on 11 Indic languages and English. IndicBART utilizes the orthographic similarity between Indic scripts to improve transfer learning between similar Indic languages. We evaluate IndicBART on two NLG tasks: Neural Machine Translation (NMT) and extreme summarization. Our experiments on NMT and extreme summarization show that a model specific to related languages like IndicBART is competitive with large pre-trained models like mBART50 despite being significantly smaller. It also performs well on very low-resource translation scenarios where languages are not included in pre-training or fine-tuning. Script sharing, multilingual training, and better utilization of limited model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
