Quantifying and Reducing Speaker Heterogeneity within the Common Voice Corpus for Phonetic Analysis

Miao Zhang; Aref Farhadipour; Annie Baker; Jiachen Ma; Bogdan Pricop; Eleanor Chodroff

arXiv:2506.00733·eess.AS·June 3, 2025

Quantifying and Reducing Speaker Heterogeneity within the Common Voice Corpus for Phonetic Analysis

Miao Zhang, Aref Farhadipour, Annie Baker, Jiachen Ma, Bogdan Pricop, Eleanor Chodroff

PDF

Open Access

TL;DR

This paper investigates how to better identify individual speakers within the Mozilla Common Voice Corpus by quantifying and reducing heterogeneity using voice embeddings, improving phonetic analysis and speech technology applications.

Contribution

It introduces a method using ResNet-based voice embeddings to quantify and reduce speaker heterogeneity in the Common Voice Corpus, enhancing speaker identification accuracy.

Findings

01

ResNet-based voice embeddings effectively measure speaker similarity.

02

A threshold for reducing heterogeneity improves speaker clustering.

03

Enhanced speaker identification benefits phonetic research and speech technology.

Abstract

With its crosslinguistic and cross-speaker diversity, the Mozilla Common Voice Corpus (CV) has been a valuable resource for multilingual speech technology and holds tremendous potential for research in crosslinguistic phonetics and speech sciences. Properly accounting for speaker variation is, however, key to the theoretical and statistical bases of speech research. While CV provides a client ID as an approximation to a speaker ID, multiple speakers can contribute under the same ID. This study aims to quantify and reduce heterogeneity in the client ID for a better approximation of a true, though still anonymous speaker ID. Using ResNet-based voice embeddings, we obtained a similarity score among recordings with the same client ID, then implemented a speaker discrimination task to identify an optimal threshold for reducing perceived speaker heterogeneity. These results have major…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis