Robust Target Speaker Diarization and Separation via Augmented Speaker Embedding Sampling

Md Asif Jalal; Luca Remaggi; Vasileios Moschopoulos; Thanasis Kotsiopoulos; Vandana Rajan; Karthikeyan Saravanan; Anastasis Drosou; Junho Heo; Hyuk Oh; Seokyeong Jeong

arXiv:2508.06393·cs.SD·August 11, 2025

Robust Target Speaker Diarization and Separation via Augmented Speaker Embedding Sampling

Md Asif Jalal, Luca Remaggi, Vasileios Moschopoulos, Thanasis Kotsiopoulos, Vandana Rajan, Karthikeyan Saravanan, Anastasis Drosou, Junho Heo, Hyuk Oh, Seokyeong Jeong

PDF

Open Access

TL;DR

This paper presents a novel enrollment-free method for simultaneous speech separation and diarization that uses robust speaker embeddings and an overlapping spectral loss, significantly improving performance over existing methods.

Contribution

It introduces a dual-stage training pipeline with a new spectral loss function for better diarization and separation without prior speaker knowledge.

Findings

01

Achieves 71% relative improvement in DER

02

Achieves 69% relative improvement in cpWER

03

Demonstrates robustness to background noise

Abstract

Traditional speech separation and speaker diarization approaches rely on prior knowledge of target speakers or a predetermined number of participants in audio signals. To address these limitations, recent advances focus on developing enrollment-free methods capable of identifying targets without explicit speaker labeling. This work introduces a new approach to train simultaneous speech separation and diarization using automatic identification of target speaker embeddings, within mixtures. Our proposed model employs a dual-stage training pipeline designed to learn robust speaker representation features that are resilient to background noise interference. Furthermore, we present an overlapping spectral loss function specifically tailored for enhancing diarization accuracy during overlapped speech frames. Experimental results show significant performance gains compared to the current SOTA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing