Inter-Speaker Relative Cues for Text-Guided Target Speech Extraction
Wang Dai, Archontis Politis, Tuomas Virtanen

TL;DR
This paper introduces a novel method using inter-speaker relative cues, both continuous and discrete, to improve text-guided target speech extraction, showing enhanced performance and flexibility over traditional attribute classification methods.
Contribution
The paper presents a new approach leveraging inter-speaker relative cues for speech extraction, enabling easier dataset expansion and improved robustness across languages and conditions.
Findings
Combining all relative cues outperforms using subsets.
Gender and temporal order cues are most robust.
Additional cues like pitch and language improve performance in complex scenarios.
Abstract
We propose a novel approach that utilizes inter-speaker relative cues to distinguish target speakers and extract their voices from mixtures. Continuous cues (e.g., temporal order, age, pitch level) are grouped by relative differences, while discrete cues (e.g., language, gender, emotion) retain their categorical distinctions. Compared to fixed speech attribute classification, inter-speaker relative cues offer greater flexibility, facilitating much easier expansion of text-guided target speech extraction datasets. Our experiments show that combining all relative cues yields better performance than random subsets, with gender and temporal order being the most robust across languages and reverberant conditions. Additional cues, such as pitch level, loudness, distance, speaking duration, language, and pitch range, also demonstrate notable benefits in complex scenarios. Fine-tuning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition
