Hybrid Neural-LLM Pipeline for Morphological Glossing in Endangered Language Documentation: A Case Study of Jungar Tuvan

Siyu Liang; Talant Mawkanuli; Gina-Anne Levow

arXiv:2603.00923·cs.CL·March 3, 2026

Hybrid Neural-LLM Pipeline for Morphological Glossing in Endangered Language Documentation: A Case Study of Jungar Tuvan

Siyu Liang, Talant Mawkanuli, Gina-Anne Levow

PDF

Open Access 1 Video

TL;DR

This paper introduces a hybrid neural and LLM-based pipeline for automatic morphological glossing, significantly reducing annotation effort in endangered language documentation, demonstrated on the low-resource Jungar Tuvan language.

Contribution

It presents a novel two-stage pipeline combining neural sequence labeling with LLM post-correction, with insights into retrieval-augmented prompting and the impact of dictionaries.

Findings

01

Retrieval-augmented prompting improves accuracy.

02

Dictionaries can negatively impact performance.

03

Performance scales logarithmically with few-shot examples.

Abstract

Interlinear glossed text (IGT) creation remains a major bottleneck in linguistic documentation and fieldwork, particularly for low-resource morphologically rich languages. We present a hybrid automatic glossing pipeline that combines neural sequence labeling with large language model (LLM) post-correction, evaluated on Jungar Tuvan, a low-resource Turkic language. Through systematic ablation studies, we show that retrieval-augmented prompting provides substantial gains over random example selection. We further find that morpheme dictionaries paradoxically hurt performance compared to providing no dictionary at all in most cases, and that performance scales approximately logarithmically with the number of few-shot examples. Most significantly, our two-stage pipeline combining a BiLSTM-CRF model with LLM post-correction yields substantial gains for most models, achieving meaningful…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Hybrid Neural-LLM Pipeline for Morphological Glossing in Endangered Language Documentation: A Case Study of Jungar Tuvan· underline

Taxonomy

TopicsNatural Language Processing Techniques · Language and cultural evolution · Linguistics and Cultural Studies