TharuChat: Bootstrapping Large Language Models for a Low-Resource Language via Synthetic Data and Human Validation

Prajwal Panth; Agniva Maiti

arXiv:2603.17220·cs.CL·March 19, 2026

TharuChat: Bootstrapping Large Language Models for a Low-Resource Language via Synthetic Data and Human Validation

Prajwal Panth, Agniva Maiti

PDF

Open Access

TL;DR

This paper introduces TharuChat, a synthetic dataset created with LLMs and human validation to develop a specialized language model for Tharu, an under-resourced Himalayan language, demonstrating effective language preservation despite data limitations.

Contribution

We present a novel LLM-to-Human bootstrapping pipeline for synthetic data creation and develop Tharu-LLaMA, a language model tailored for Tharu, addressing data scarcity and linguistic diversity.

Findings

01

Synthetic data significantly reduces perplexity in language modeling.

02

Small-scale synthetic datasets can effectively improve model performance.

03

The approach enables language preservation on consumer hardware.

Abstract

The rapid proliferation of Large Language Models (LLMs) has created a profound digital divide, effectively excluding indigenous languages of the Global South from the AI revolution. The Tharu language, an Indo-Aryan vernacular spoken by approximately 1.7 million people across the Terai belt of Nepal and India, exemplifies this crisis. Despite a rich oral tradition, Tharu suffers from severe data scarcity and linguistic fragmentation, causing state-of-the-art multilingual models to routinely "hallucinate" or default to dominant high-resource neighbors like Hindi and Nepali due to contamination in pre-training corpora. This paper presents Tharu-LLaMA (3B), a specialized instruction-following model designed to address this exclusion. We introduce TharuChat, a novel dataset constructed via a LLM-to-Human bootstrapping pipeline. We utilized prompt-engineered Gemini models, fed with Rana…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage and cultural evolution · ICT in Developing Communities · Natural Language Processing Techniques