SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline

Helin Wang; Jiarui Hai; Dongchao Yang; Chen Chen; Kai Li; Junyi Peng; Thomas Thebaud; Laureano Moro Velazquez; Jesus Villalba; Najim Dehak

arXiv:2505.19314·eess.AS·September 9, 2025

SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline

Helin Wang, Jiarui Hai, Dongchao Yang, Chen Chen, Kai Li, Junyi Peng, Thomas Thebaud, Laureano Moro Velazquez, Jesus Villalba, Najim Dehak

PDF

1 Repo 1 Models

TL;DR

SoloSpeech introduces a cascaded generative pipeline for target speech extraction that significantly improves intelligibility and quality, outperforming existing models and generalizing well to real-world data.

Contribution

It presents a novel generative approach with a speaker-embedding-free extractor, enhancing robustness and perceptual quality in target speech extraction.

Findings

01

Achieves state-of-the-art intelligibility and quality on Libri2Mix

02

Demonstrates strong generalization to out-of-domain data

03

Outperforms discriminative models in perceptual metrics

Abstract

Target Speech Extraction (TSE) aims to isolate a target speaker's voice from a mixture of multiple speakers by leveraging speaker-specific cues, typically provided as auxiliary audio (a.k.a. cue audio). Although recent advancements in TSE have primarily employed discriminative models that offer high perceptual quality, these models often introduce unwanted artifacts, reduce naturalness, and are sensitive to discrepancies between training and testing environments. On the other hand, generative models for TSE lag in perceptual quality and intelligibility. To address these challenges, we present SoloSpeech, a novel cascaded generative pipeline that integrates compression, extraction, reconstruction, and correction processes. SoloSpeech features a speaker-embedding-free target extractor that utilizes conditional information from the cue audio's latent space, aligning it with the mixture…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wanghelin1997/solospeech
pytorchOfficial

Models

🤗
OpenSound/SoloSpeech-models
model· ♡ 6
♡ 6

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.