Hard to Be Heard: Phoneme-Level ASR Analysis of Phonologically Complex, Low-Resource Endangered Languages
V.S.D.S.Mahesh Akavarapu, Michael Daniel, Gerhard J\"ager

TL;DR
This paper analyzes phoneme-level ASR performance on two low-resource, complex East Caucasian languages, revealing data scarcity as a key factor influencing errors and demonstrating the importance of phoneme-level evaluation.
Contribution
It introduces a phoneme-level analysis framework for low-resource languages, compares state-of-the-art models, and highlights data scarcity's impact on phoneme recognition accuracy.
Findings
Phoneme recognition accuracy correlates with training frequency.
Wav2vec2 with language-specific phoneme vocabulary performs well.
Data scarcity explains many errors attributed to phonological complexity.
Abstract
We present a phoneme-level analysis of automatic speech recognition (ASR) for two low-resourced and phonologically complex East Caucasian languages, Archi and Rutul, based on curated and standardized speech-transcript resources totaling approximately 50 minutes and 1 hour 20 minutes of audio, respectively. Existing recordings and transcriptions are consolidated and processed into a form suitable for ASR training and evaluation. We evaluate several state-of-the-art audio and audio-language models, including wav2vec2, Whisper, and Qwen2-Audio. For wav2vec2, we introduce a language-specific phoneme vocabulary with heuristic output-layer initialization, which yields consistent improvements and achieves performance comparable to or exceeding Whisper in these extremely low-resource settings. Beyond standard word and character error rates, we conduct a detailed phoneme-level error analysis. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
