NagaNLP: Bootstrapping NLP for Low-Resource Nagamese Creole with Human-in-the-Loop Synthetic Data

Agniva Maiti; Manya Pandey; Murari Mandal

arXiv:2512.12537·cs.CL·December 16, 2025

NagaNLP: Bootstrapping NLP for Low-Resource Nagamese Creole with Human-in-the-Loop Synthetic Data

Agniva Maiti, Manya Pandey, Murari Mandal

PDF

Open Access 3 Datasets

TL;DR

This paper presents NagaNLP, an open-source toolkit for Nagamese that uses human-validated synthetic data and LLMs to develop NLP models, significantly improving performance on foundational tasks for this low-resource creole language.

Contribution

It introduces a novel human-in-the-loop synthetic data generation pipeline and establishes new benchmarks for Nagamese NLP, including models and datasets.

Findings

01

Achieved 93.81% accuracy on POS tagging

02

Attained 0.75 F1 on NER tasks

03

Developed a conversational model with Perplexity 3.85

Abstract

The vast majority of the world's languages, particularly creoles like Nagamese, remain severely under-resourced in Natural Language Processing (NLP), creating a significant barrier to their representation in digital technology. This paper introduces NagaNLP, a comprehensive open-source toolkit for Nagamese, bootstrapped through a novel methodology that relies on LLM-driven but human-validated synthetic data generation. We detail a multi-stage pipeline where an expert-guided LLM (Gemini) generates a candidate corpus, which is then refined and annotated by native speakers. This synthetic-hybrid approach yielded a 10K pair conversational dataset and a high-quality annotated corpus for foundational tasks. To assess the effectiveness of our methodology, we trained both discriminative and generative models. Our fine-tuned XLM-RoBERTa-base model establishes a new benchmark for Nagamese,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis