Towards dialect-inclusive recognition in a low-resource language: are balanced corpora the answer?
Liam Lonergan, Mengjie Qian, Neasa N\'i Chiar\'ain, Christer Gobl,, Ailbhe N\'i Chasaide

TL;DR
This study investigates whether balanced dialect corpora improve automatic speech recognition for Irish's three major dialects, revealing that balanced data alone does not ensure equitable performance across dialects.
Contribution
It provides empirical evidence on the effects of dialect-balanced training data on ASR performance for multiple Irish dialects, highlighting the asymmetrical benefits and limitations.
Findings
Dialect-balanced corpora do not produce equal performance across dialects.
Ulster dialect consistently underperforms in recognition accuracy.
Munster dialect yields the lowest word error rates.
Abstract
ASR systems are generally built for the spoken 'standard', and their performance declines for non-standard dialects/varieties. This is a problem for a language like Irish, where there is no single spoken standard, but rather three major dialects: Ulster (Ul), Connacht (Co) and Munster (Mu). As a diagnostic to quantify the effect of the speaker's dialect on recognition performance, 12 ASR systems were trained, firstly using baseline dialect-balanced training corpora, and then using modified versions of the baseline corpora, where dialect-specific materials were either subtracted or added. Results indicate that dialect-balanced corpora do not yield a similar performance across the dialects: the Ul dialect consistently underperforms, whereas Mu yields lowest WERs. There is a close relationship between Co and Mu dialects, but one that is not symmetrical. These results will guide future…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Interpreting and Communication in Healthcare · Linguistic Variation and Morphology
