Inter-Speaker Relative Cues for Text-Guided Target Speech Extraction

Wang Dai; Archontis Politis; Tuomas Virtanen

arXiv:2506.01483·eess.AS·June 10, 2025

Inter-Speaker Relative Cues for Text-Guided Target Speech Extraction

Wang Dai, Archontis Politis, Tuomas Virtanen

PDF

Open Access

TL;DR

This paper introduces a novel method using inter-speaker relative cues, both continuous and discrete, to improve text-guided target speech extraction, showing enhanced performance and flexibility over traditional attribute classification methods.

Contribution

The paper presents a new approach leveraging inter-speaker relative cues for speech extraction, enabling easier dataset expansion and improved robustness across languages and conditions.

Findings

01

Combining all relative cues outperforms using subsets.

02

Gender and temporal order cues are most robust.

03

Additional cues like pitch and language improve performance in complex scenarios.

Abstract

We propose a novel approach that utilizes inter-speaker relative cues to distinguish target speakers and extract their voices from mixtures. Continuous cues (e.g., temporal order, age, pitch level) are grouped by relative differences, while discrete cues (e.g., language, gender, emotion) retain their categorical distinctions. Compared to fixed speech attribute classification, inter-speaker relative cues offer greater flexibility, facilitating much easier expansion of text-guided target speech extraction datasets. Our experiments show that combining all relative cues yields better performance than random subsets, with gender and temporal order being the most robust across languages and reverberant conditions. Additional cues, such as pitch level, loudness, distance, speaking duration, language, and pitch range, also demonstrate notable benefits in complex scenarios. Fine-tuning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition