Unmixing the Crowd: Learning Mixture-to-Set Speaker Embeddings for Enrollment-Free Target Speech Extraction

FNU Sidharth; Meysam Asgari; Hao-Wen Dong; Dhruv Jain

arXiv:2604.03219·eess.AS·April 6, 2026

Unmixing the Crowd: Learning Mixture-to-Set Speaker Embeddings for Enrollment-Free Target Speech Extraction

FNU Sidharth, Meysam Asgari, Hao-Wen Dong, Dhruv Jain

PDF

TL;DR

This paper introduces a method to perform target speech extraction without enrollment by predicting speaker embeddings directly from noisy mixtures, improving performance in crowded environments.

Contribution

It proposes a novel approach to generate speaker embeddings from mixtures, eliminating the need for enrollment data in target speech extraction.

Findings

01

Embeddings form a structured, clusterable identity space on noisy LibriMix.

02

Conditioned embeddings improve extraction quality and intelligibility.

03

Method generalizes well to real DNS-Challenge recordings.

Abstract

Personalized or target speech extraction (TSE) typically needs a clean enrollment -- hard to obtain in real-world crowded environments. We remove the essential need for enrollment by predicting, from the mixture itself, a small set of per-speaker embeddings that serve as the control signal for extraction. Our model maps a noisy mixture directly to a small set of candidate speaker embeddings trained to align with a strong single-speaker speaker-embedding space via permutation-invariant teacher supervision. On noisy LibriMix, the resulting embeddings form a structured and clusterable identity space, outperforming WavLM+K-means and separation-derived embeddings in standard clustering metrics. Conditioning these embeddings into multiple extraction back-ends consistently improves objective quality and intelligibility, and generalizes to real DNS-Challenge recordings.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.