Automated Quality Control for Language Documentation: Detecting Phonotactic Inconsistencies in a Kokborok Wordlist

Kellen Parker van Dam; Abishek Stephen

arXiv:2510.21584·cs.CL·February 12, 2026

Automated Quality Control for Language Documentation: Detecting Phonotactic Inconsistencies in a Kokborok Wordlist

Kellen Parker van Dam, Abishek Stephen

PDF

1 Video

TL;DR

This paper introduces unsupervised phonotactic anomaly detection methods to improve quality control in language documentation, specifically identifying transcription errors and borrowings in Kokborok wordlists.

Contribution

It presents novel unsupervised algorithms using character and syllable features to detect inconsistencies, enhancing data quality in low-resource language documentation.

Findings

01

Syllable-aware features outperform character-level baselines.

02

High-recall methods effectively flag entries for verification.

03

Approach aids fieldworkers in systematic quality control.

Abstract

Lexical data collection in language documentation often contains transcription errors and undocumented borrowings that can mislead linguistic analysis. We present unsupervised anomaly detection methods to identify phonotactic inconsistencies in wordlists, applying them to a multilingual dataset of Kokborok varieties with Bangla. Using character-level and syllable-level phonotactic features, our algorithms identify potential transcription errors and borrowings. While precision and recall remain modest due to the subtle nature of these anomalies, syllable-aware features significantly outperform character-level baselines. The high-recall approach provides fieldworkers with a systematic method to flag entries requiring verification, supporting data quality improvement in low-resourced language documentation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Automated Quality Control for Language Documentation: Detecting Phonotactic Inconsistencies in a Kokborok Wordlist· underline