TL;DR
This paper introduces unsupervised phonotactic anomaly detection methods to improve quality control in language documentation, specifically identifying transcription errors and borrowings in Kokborok wordlists.
Contribution
It presents novel unsupervised algorithms using character and syllable features to detect inconsistencies, enhancing data quality in low-resource language documentation.
Findings
Syllable-aware features outperform character-level baselines.
High-recall methods effectively flag entries for verification.
Approach aids fieldworkers in systematic quality control.
Abstract
Lexical data collection in language documentation often contains transcription errors and undocumented borrowings that can mislead linguistic analysis. We present unsupervised anomaly detection methods to identify phonotactic inconsistencies in wordlists, applying them to a multilingual dataset of Kokborok varieties with Bangla. Using character-level and syllable-level phonotactic features, our algorithms identify potential transcription errors and borrowings. While precision and recall remain modest due to the subtle nature of these anomalies, syllable-aware features significantly outperform character-level baselines. The high-recall approach provides fieldworkers with a systematic method to flag entries requiring verification, supporting data quality improvement in low-resourced language documentation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
