Libri2Vox Dataset: Target Speaker Extraction with Diverse Speaker Conditions and Synthetic Data
Yun Liu, Xuechen Liu, Xiaoxiao Miao, Junichi Yamagishi

TL;DR
This paper introduces Libri2Vox, a diverse and realistic dataset for target speaker extraction, combined with synthetic data and curriculum learning to improve model robustness and performance in complex acoustic environments.
Contribution
The paper presents Libri2Vox, a novel dataset with real and synthetic speakers, and demonstrates how curriculum learning enhances TSE model performance using this data.
Findings
SpeakerBeam achieved a 1.39 dB SDR improvement on Libri2Talker.
Curriculum learning with the Conformer architecture further improved SDR by 0.78 dB.
Diverse real-world data and synthetic augmentation significantly enhance TSE robustness.
Abstract
Target speaker extraction (TSE) is essential in speech processing applications, particularly in scenarios with complex acoustic environments. Current TSE systems face challenges in limited data diversity and a lack of robustness in real-world conditions, primarily because they are trained on artificially mixed datasets with limited speaker variability and unrealistic noise profiles. To address these challenges, we propose Libri2Vox, a new dataset that combines clean target speech from the LibriTTS dataset with interference speech from the noisy VoxCeleb2 dataset, providing a large and diverse set of speakers under realistic noisy conditions. We also augment Libri2Vox with synthetic speakers generated using state-of-the-art speech generative models to enhance speaker diversity. Additionally, to further improve the effectiveness of incorporating synthetic data, curriculum learning is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
