TidyVoice: A Curated Multilingual Dataset for Speaker Verification Derived from Common Voice

Aref Farhadipour; Jan Marquenie; Srikanth Madikeri; Eleanor Chodroff

arXiv:2601.16358·eess.AS·January 26, 2026

TidyVoice: A Curated Multilingual Dataset for Speaker Verification Derived from Common Voice

Aref Farhadipour, Jan Marquenie, Srikanth Madikeri, Eleanor Chodroff

PDF

Open Access 1 Models

TL;DR

This paper introduces TidyVoice, a large multilingual speaker verification dataset derived from Common Voice, along with ResNet-based models achieving low error rates and improved generalization, supporting robust multilingual speaker recognition research.

Contribution

The paper presents a curated, large-scale multilingual speaker verification dataset and demonstrates effective ResNet models fine-tuned on this data, enhancing performance and generalization.

Findings

01

ResNet models achieved 0.35% EER on Tidy-M.

02

Fine-tuning improved generalization on unseen data.

03

Public release of dataset and models supports community research.

Abstract

The development of robust, multilingual speaker recognition systems is hindered by a lack of large-scale, publicly available and multilingual datasets, particularly for the read-speech style crucial for applications like anti-spoofing. To address this gap, we introduce the TidyVoice dataset derived from the Mozilla Common Voice corpus after mitigating its inherent speaker heterogeneity within the provided client IDs. TidyVoice currently contains training and test data from over 212,000 monolingual speakers (Tidy-M) and around 4,500 multilingual speakers (Tidy-X) from which we derive two distinct conditions. The Tidy-M condition contains target and non-target trials from monolingual speakers across 81 languages. The Tidy-X condition contains target and non-target trials from multilingual speakers in both same- and cross-language trials. We employ two architectures of ResNet models,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
areffarhadi/W2V-large-tidylang
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Authorship Attribution and Profiling