Improving curriculum learning for target speaker extraction with   synthetic speakers

Yun Liu; Xuechen Liu; Junichi Yamagishi

arXiv:2410.00811·cs.SD·October 8, 2024

Improving curriculum learning for target speaker extraction with synthetic speakers

Yun Liu, Xuechen Liu, Junichi Yamagishi

PDF

Open Access

TL;DR

This paper enhances target speaker extraction by integrating synthetic speakers generated via voice conversion into curriculum learning, leading to improved model performance in complex speech environments.

Contribution

It introduces a k-nearest neighbor-based voice conversion method to generate diverse interference speakers for curriculum learning in TSE.

Findings

01

Synthetic speaker data improves TSE performance

02

Curriculum learning with synthetic data enhances model robustness

03

Significant accuracy gains in complex speech scenarios

Abstract

Target speaker extraction (TSE) aims to isolate individual speaker voices from complex speech environments. The effectiveness of TSE systems is often compromised when the speaker characteristics are similar to each other. Recent research has introduced curriculum learning (CL), in which TSE models are trained incrementally on speech samples of increasing complexity. In CL training, the model is first trained on samples with low speaker similarity between the target and interference speakers, and then on samples with high speaker similarity. To further improve CL, this paper uses a $k$ -nearest neighbor-based voice conversion method to simulate and generate speech of diverse interference speakers, and then uses the generated data as part of the CL. Experiments demonstrate that training data based on synthetic speakers can effectively enhance the model's capabilities and significantly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis