Some voices are too common: Building fair speech recognition systems using the Common Voice dataset
Lucas Maison, Yannick Est\`eve

TL;DR
This paper investigates biases in speech recognition systems, specifically wav2vec 2.0, using the French Common Voice dataset, highlighting the importance of speaker diversity and dataset shortcomings for fairer AI development.
Contribution
It provides a detailed analysis of demographic biases in a popular speech dataset and demonstrates how fine-tuning with diverse data can improve fairness in ASR systems.
Findings
Biases toward certain demographic groups are quantifiable in the dataset.
Diversity in training data significantly affects model fairness.
Identified key shortcomings in the Common Voice dataset.
Abstract
Automatic speech recognition (ASR) systems become increasingly efficient thanks to new advances in neural network training like self-supervised learning. However, they are known to be unfair toward certain groups, for instance, people speaking with an accent. In this work, we use the French Common Voice dataset to quantify the biases of a pre-trained wav2vec~2.0 model toward several demographic groups. By fine-tuning the pre-trained model on a variety of fixed-size, carefully crafted training sets, we demonstrate the importance of speaker diversity. We also run an in-depth analysis of the Common Voice corpus and identify important shortcomings that should be taken into account by users of this dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques
