NepTam: A Nepali-Tamang Parallel Corpus and Baseline Machine Translation Experiments

Rupak Raj Ghimire; Bipesh Subedi; Balaram Prasain; Prakash Poudyal; Praveen Acharya; Nischal Karki; Rupak Tiwari; Rishikesh Kumar Sharma; Jenny Poudel; Bal Krishna Bal

arXiv:2603.14053·cs.CL·March 17, 2026

NepTam: A Nepali-Tamang Parallel Corpus and Baseline Machine Translation Experiments

Rupak Raj Ghimire, Bipesh Subedi, Balaram Prasain, Prakash Poudyal, Praveen Acharya, Nischal Karki, Rupak Tiwari, Rishikesh Kumar Sharma, Jenny Poudel, Bal Krishna Bal

PDF

Open Access

TL;DR

This paper introduces NepTam, a new Nepali-Tamang parallel corpus of 20K high-quality and 80K synthetic sentence pairs, enabling machine translation research for these low-resource languages, with baseline experiments showing promising results.

Contribution

It creates the first large-scale Nepali-Tamang parallel corpus and evaluates multiple multilingual models, establishing a baseline for future translation research.

Findings

01

NLLB-200 achieved the highest BLEU scores of 40.92 and 45.26.

02

The datasets cover five diverse domains.

03

Baseline models demonstrate effective translation performance.

Abstract

Modern Translation Systems heavily rely on high-quality, large parallel datasets for state-of-the-art performance. However, such resources are largely unavailable for most of the South Asian languages. Among them, Nepali and Tamang fall into such category, with Tamang being among the least digitally resourced languages in the region. This work addresses the gap by developing NepTam20K, a 20K gold standard parallel corpus, and NepTam80K, an 80K synthetic Nepali-Tamang parallel corpus, both sentence-aligned and designed to support machine translation. The datasets were created through a pipeline involving data scraping from Nepali news and online sources, pre-processing, semantic filtering, balancing for tense and polarity (in NepTam20K dataset), expert translation into Tamang by native speakers of the language, and verification by an expert Tamang linguist. The dataset covers five…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multilingual Education and Policy