Universal Speaker Embedding Free Target Speaker Extraction and Personal Voice Activity Detection

Bang Zeng; Ming Li

arXiv:2501.03612·eess.AS·May 20, 2025

Universal Speaker Embedding Free Target Speaker Extraction and Personal Voice Activity Detection

Bang Zeng, Ming Li

PDF

Open Access

TL;DR

This paper introduces USEF-TP, a novel model that jointly performs target speaker extraction and personal voice activity detection without relying on speaker embeddings, improving robustness across diverse scenarios.

Contribution

The paper proposes a universal, embedding-free approach for TSE and PVAD using cross-attention and multi-task learning, addressing inconsistencies in traditional methods.

Findings

01

Achieves superior TSE and PVAD performance on LibriMix and SparseLibriMix datasets.

02

Demonstrates competitive results on real-world CALLHOME recordings.

03

Outperforms existing methods in handling overlapping speakers.

Abstract

Determining 'who spoke what and when' remains challenging in real-world applications. In typical scenarios, Speaker Diarization (SD) is employed to address the problem of 'who spoke when,' while Target Speaker Extraction (TSE) or Target Speaker Automatic Speech Recognition (TSASR) techniques are utilized to resolve the issue of 'who spoke what.' Although some works have achieved promising results by combining SD and TSE systems, inconsistencies remain between SD and TSE regarding both output inconsistency and scenario mismatch. To address these limitations, we propose a Universal Speaker Embedding Free Target Speaker Extraction and Personal Voice Activity Detection (USEF-TP) model that jointly performs TSE and Personal Voice Activity Detection (PVAD). USEF-TP leverages frame-level features obtained through a cross-attention mechanism as speaker-related features instead of using speaker…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing