TidyVoice: A Curated Multilingual Dataset for Speaker Verification Derived from Common Voice
Aref Farhadipour, Jan Marquenie, Srikanth Madikeri, Eleanor Chodroff

TL;DR
This paper introduces TidyVoice, a large multilingual speaker verification dataset derived from Common Voice, along with ResNet-based models achieving low error rates and improved generalization, supporting robust multilingual speaker recognition research.
Contribution
The paper presents a curated, large-scale multilingual speaker verification dataset and demonstrates effective ResNet models fine-tuned on this data, enhancing performance and generalization.
Findings
ResNet models achieved 0.35% EER on Tidy-M.
Fine-tuning improved generalization on unseen data.
Public release of dataset and models supports community research.
Abstract
The development of robust, multilingual speaker recognition systems is hindered by a lack of large-scale, publicly available and multilingual datasets, particularly for the read-speech style crucial for applications like anti-spoofing. To address this gap, we introduce the TidyVoice dataset derived from the Mozilla Common Voice corpus after mitigating its inherent speaker heterogeneity within the provided client IDs. TidyVoice currently contains training and test data from over 212,000 monolingual speakers (Tidy-M) and around 4,500 multilingual speakers (Tidy-X) from which we derive two distinct conditions. The Tidy-M condition contains target and non-target trials from monolingual speakers across 81 languages. The Tidy-X condition contains target and non-target trials from multilingual speakers in both same- and cross-language trials. We employ two architectures of ResNet models,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Authorship Attribution and Profiling
