Effects of Speaker Count, Duration, and Accent Diversity on Zero-Shot Accent Robustness in Low-Resource ASR

Zheng-Xin Yong; Vineel Pratap; Michael Auli; Jean Maillard

arXiv:2506.04364·cs.CL·June 6, 2025

Effects of Speaker Count, Duration, and Accent Diversity on Zero-Shot Accent Robustness in Low-Resource ASR

Zheng-Xin Yong, Vineel Pratap, Michael Auli, Jean Maillard

PDF

Open Access

TL;DR

This study investigates how the number of speakers, their speaking duration, and accent diversity in training data influence the robustness of low-resource ASR systems to unseen accents, highlighting the importance of speaker count over accent diversity.

Contribution

The paper systematically analyzes the impact of speaker count, duration, and accent diversity on ASR robustness, providing practical guidance for data collection in low-resource language settings.

Findings

01

Increasing speaker count improves ASR robustness more than increasing duration per speaker.

02

More speakers enable better scaling of training hours for improved performance.

03

Accent diversity has minimal impact when speaker count is fixed.

Abstract

To build an automatic speech recognition (ASR) system that can serve everyone in the world, the ASR needs to be robust to a wide range of accents including unseen accents. We systematically study how three different variables in training data -- the number of speakers, the audio duration per each individual speaker, and the diversity of accents -- affect ASR robustness towards unseen accents in a low-resource training regime. We observe that for a fixed number of ASR training hours, it is more beneficial to increase the number of speakers (which means each speaker contributes less) than the number of hours contributed per speaker. We also observe that more speakers enables ASR performance gains from scaling number of hours. Surprisingly, we observe minimal benefits to prioritizing speakers with different accents when the number of speakers is controlled. Our work suggests that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · ICT in Developing Communities