TL;DR
This paper explores how minimal-cost, community-informed NLP techniques can support the preservation of the endangered Comanche language, demonstrating promising results with large language models in low-resource settings.
Contribution
It introduces the first computational study of Comanche, including a curated dataset, data generation pipeline, and evaluation of GPT models for language identification.
Findings
Few-shot prompting greatly improves LLM performance on Comanche
LLMs struggle with zero-shot language identification in low-resource settings
Targeted NLP approaches can aid endangered language preservation
Abstract
The digital exclusion of endangered languages remains a critical challenge in NLP, limiting both linguistic research and revitalization efforts. This study introduces the first computational investigation of Comanche, an Uto-Aztecan language on the verge of extinction, demonstrating how minimal-cost, community-informed NLP interventions can support language preservation. We present a manually curated dataset of 412 phrases, a synthetic data generation pipeline, and an empirical evaluation of GPT-4o and GPT-4o-mini for language identification. Our experiments reveal that while LLMs struggle with Comanche in zero-shot settings, few-shot prompting significantly improves performance, achieving near-perfect accuracy with just five examples. Our findings highlight the potential of targeted NLP methodologies in low-resource contexts and emphasize that visibility is the first step toward…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
