Inter-Speaker Relative Cues for Two-Stage Text-Guided Target Speech Extraction
Wang Dai, Archontis Politis, Tuomas Virtanen

TL;DR
This paper introduces a two-stage text-guided target speech extraction framework utilizing relative cues, demonstrating improved accuracy and performance over traditional methods through experimental validation.
Contribution
It proposes a novel two-stage TSE approach that leverages relative cues, showing their advantages over independent cues and surpassing single-stage methods.
Findings
Relative cues improve classification accuracy and TSE performance.
Two-stage framework outperforms single-stage text-conditioned extraction.
Several relative cues can surpass enrollment-audio-based TSE systems.
Abstract
This paper investigates the use of relative cues for text-based target speech extraction (TSE). We first provide a theoretical justification for relative cues from the perspectives of human perception and label quantization, showing that relative cues preserve fine-grained distinctions that are often lost in absolute categorical representations for continuous-valued attributes. Building on this analysis, we propose a two-stage TSE framework in which a speech separation model first generates candidate sources, followed by a text-guided classifier that selects the target speaker based on embedding similarity. Within this framework, we train two separate classification models to evaluate the advantages of relative cues over independent cues in case of continuous-valued attributes, considering both classification accuracy and TSE performance. Experimental results demonstrate that (i)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
