Hybrid Neural-LLM Pipeline for Morphological Glossing in Endangered Language Documentation: A Case Study of Jungar Tuvan
Siyu Liang, Talant Mawkanuli, Gina-Anne Levow

TL;DR
This paper introduces a hybrid neural and LLM-based pipeline for automatic morphological glossing, significantly reducing annotation effort in endangered language documentation, demonstrated on the low-resource Jungar Tuvan language.
Contribution
It presents a novel two-stage pipeline combining neural sequence labeling with LLM post-correction, with insights into retrieval-augmented prompting and the impact of dictionaries.
Findings
Retrieval-augmented prompting improves accuracy.
Dictionaries can negatively impact performance.
Performance scales logarithmically with few-shot examples.
Abstract
Interlinear glossed text (IGT) creation remains a major bottleneck in linguistic documentation and fieldwork, particularly for low-resource morphologically rich languages. We present a hybrid automatic glossing pipeline that combines neural sequence labeling with large language model (LLM) post-correction, evaluated on Jungar Tuvan, a low-resource Turkic language. Through systematic ablation studies, we show that retrieval-augmented prompting provides substantial gains over random example selection. We further find that morpheme dictionaries paradoxically hurt performance compared to providing no dictionary at all in most cases, and that performance scales approximately logarithmically with the number of few-shot examples. Most significantly, our two-stage pipeline combining a BiLSTM-CRF model with LLM post-correction yields substantial gains for most models, achieving meaningful…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Language and cultural evolution · Linguistics and Cultural Studies
