Beyond Speaker Identity: Text Guided Target Speech Extraction
Mingyue Huo, Abhinav Jain, Cong Phuoc Huynh, Fanjie Kong, Pichao Wang,, Zhu Liu, Vimal Bhat

TL;DR
This paper introduces StyleTSE, a novel text-guided target speech extraction model that leverages natural language descriptions of speaking style, enabling speech separation without explicit speaker identity clues.
Contribution
The paper presents StyleTSE, the first model to incorporate natural language descriptions for target speech extraction, along with a new dataset TextrolMix for training and evaluation.
Findings
Effective speech separation based on speaking style and content.
Outperforms traditional methods relying solely on audio clues.
Demonstrates robustness in scenarios lacking explicit speaker identity information.
Abstract
Target Speech Extraction (TSE) traditionally relies on explicit clues about the speaker's identity like enrollment audio, face images, or videos, which may not always be available. In this paper, we propose a text-guided TSE model StyleTSE that uses natural language descriptions of speaking style in addition to the audio clue to extract the desired speech from a given mixture. Our model integrates a speech separation network adapted from SepFormer with a bi-modality clue network that flexibly processes both audio and text clues. To train and evaluate our model, we introduce a new dataset TextrolMix with speech mixtures and natural language descriptions. Experimental results demonstrate that our method effectively separates speech based not only on who is speaking, but also on how they are speaking, enhancing TSE in scenarios where traditional audio clues are absent. Demos are at:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis
MethodsAttention Is All You Need · Layer Normalization · Dense Connections · *Communicated@Fast*How Do I Communicate to Expedia? · Residual Connection · Softmax · Parameterized ReLU · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer
