Typing to Listen at the Cocktail Party: Text-Guided Target Speaker Extraction
Xiang Hao, Jibin Wu, Jianwei Yu, Chenglin Xu, Kay Chen Tan

TL;DR
This paper introduces LLM-TSE, a novel text-guided target speaker extraction method using large language models, which improves privacy, flexibility, and robustness by leveraging textual cues instead of voiceprints, achieving competitive and state-of-the-art results.
Contribution
It presents the first integration of large language models with target speaker extraction, enabling text-based cues for privacy-preserving and flexible speaker extraction.
Findings
Competitive performance with text cues alone
Effective use of text as a task selector
Achieves new state-of-the-art when combined with pre-registered cues
Abstract
Humans can easily isolate a single speaker from a complex acoustic environment, a capability referred to as the "Cocktail Party Effect." However, replicating this ability has been a significant challenge in the field of target speaker extraction (TSE). Traditional TSE approaches predominantly rely on voiceprints, which raise privacy concerns and face issues related to the quality and availability of enrollment samples, as well as intra-speaker variability. To address these issues, this work introduces a novel text-guided TSE paradigm named LLM-TSE. In this paradigm, a state-of-the-art large language model, LLaMA 2, processes typed text input from users to extract semantic cues. We demonstrate that textual descriptions alone can effectively serve as cues for extraction, thus addressing privacy concerns and reducing dependency on voiceprints. Furthermore, our approach offers flexibility…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Topic Modeling
MethodsLLaMA · Focus
