NepTam: A Nepali-Tamang Parallel Corpus and Baseline Machine Translation Experiments
Rupak Raj Ghimire, Bipesh Subedi, Balaram Prasain, Prakash Poudyal, Praveen Acharya, Nischal Karki, Rupak Tiwari, Rishikesh Kumar Sharma, Jenny Poudel, Bal Krishna Bal

TL;DR
This paper introduces NepTam, a new Nepali-Tamang parallel corpus of 20K high-quality and 80K synthetic sentence pairs, enabling machine translation research for these low-resource languages, with baseline experiments showing promising results.
Contribution
It creates the first large-scale Nepali-Tamang parallel corpus and evaluates multiple multilingual models, establishing a baseline for future translation research.
Findings
NLLB-200 achieved the highest BLEU scores of 40.92 and 45.26.
The datasets cover five diverse domains.
Baseline models demonstrate effective translation performance.
Abstract
Modern Translation Systems heavily rely on high-quality, large parallel datasets for state-of-the-art performance. However, such resources are largely unavailable for most of the South Asian languages. Among them, Nepali and Tamang fall into such category, with Tamang being among the least digitally resourced languages in the region. This work addresses the gap by developing NepTam20K, a 20K gold standard parallel corpus, and NepTam80K, an 80K synthetic Nepali-Tamang parallel corpus, both sentence-aligned and designed to support machine translation. The datasets were created through a pipeline involving data scraping from Nepali news and online sources, pre-processing, semantic filtering, balancing for tense and polarity (in NepTam20K dataset), expert translation into Tamang by native speakers of the language, and verification by an expert Tamang linguist. The dataset covers five…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multilingual Education and Policy
