Typing to Listen at the Cocktail Party: Text-Guided Target Speaker   Extraction

Xiang Hao; Jibin Wu; Jianwei Yu; Chenglin Xu; Kay Chen Tan

arXiv:2310.07284·eess.AS·October 8, 2024·2 cites

Typing to Listen at the Cocktail Party: Text-Guided Target Speaker Extraction

Xiang Hao, Jibin Wu, Jianwei Yu, Chenglin Xu, Kay Chen Tan

PDF

Open Access 1 Repo

TL;DR

This paper introduces LLM-TSE, a novel text-guided target speaker extraction method using large language models, which improves privacy, flexibility, and robustness by leveraging textual cues instead of voiceprints, achieving competitive and state-of-the-art results.

Contribution

It presents the first integration of large language models with target speaker extraction, enabling text-based cues for privacy-preserving and flexible speaker extraction.

Findings

01

Competitive performance with text cues alone

02

Effective use of text as a task selector

03

Achieves new state-of-the-art when combined with pre-registered cues

Abstract

Humans can easily isolate a single speaker from a complex acoustic environment, a capability referred to as the "Cocktail Party Effect." However, replicating this ability has been a significant challenge in the field of target speaker extraction (TSE). Traditional TSE approaches predominantly rely on voiceprints, which raise privacy concerns and face issues related to the quality and availability of enrollment samples, as well as intra-speaker variability. To address these issues, this work introduces a novel text-guided TSE paradigm named LLM-TSE. In this paradigm, a state-of-the-art large language model, LLaMA 2, processes typed text input from users to extract semantic cues. We demonstrate that textual descriptions alone can effectively serve as cues for extraction, thus addressing privacy concerns and reducing dependency on voiceprints. Furthermore, our approach offers flexibility…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

haoxiangsnr/llm-tse
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Topic Modeling

MethodsLLaMA · Focus