Libri2Vox Dataset: Target Speaker Extraction with Diverse Speaker   Conditions and Synthetic Data

Yun Liu; Xuechen Liu; Xiaoxiao Miao; Junichi Yamagishi

arXiv:2412.12512·cs.SD·December 18, 2024

Libri2Vox Dataset: Target Speaker Extraction with Diverse Speaker Conditions and Synthetic Data

Yun Liu, Xuechen Liu, Xiaoxiao Miao, Junichi Yamagishi

PDF

Open Access

TL;DR

This paper introduces Libri2Vox, a diverse and realistic dataset for target speaker extraction, combined with synthetic data and curriculum learning to improve model robustness and performance in complex acoustic environments.

Contribution

The paper presents Libri2Vox, a novel dataset with real and synthetic speakers, and demonstrates how curriculum learning enhances TSE model performance using this data.

Findings

01

SpeakerBeam achieved a 1.39 dB SDR improvement on Libri2Talker.

02

Curriculum learning with the Conformer architecture further improved SDR by 0.78 dB.

03

Diverse real-world data and synthetic augmentation significantly enhance TSE robustness.

Abstract

Target speaker extraction (TSE) is essential in speech processing applications, particularly in scenarios with complex acoustic environments. Current TSE systems face challenges in limited data diversity and a lack of robustness in real-world conditions, primarily because they are trained on artificially mixed datasets with limited speaker variability and unrealistic noise profiles. To address these challenges, we propose Libri2Vox, a new dataset that combines clean target speech from the LibriTTS dataset with interference speech from the noisy VoxCeleb2 dataset, providing a large and diverse set of speakers under realistic noisy conditions. We also augment Libri2Vox with synthetic speakers generated using state-of-the-art speech generative models to enhance speaker diversity. Additionally, to further improve the effectiveness of incorporating synthetic data, curriculum learning is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing