Beyond Speaker Identity: Text Guided Target Speech Extraction

Mingyue Huo; Abhinav Jain; Cong Phuoc Huynh; Fanjie Kong; Pichao Wang,; Zhu Liu; Vimal Bhat

arXiv:2501.09169·eess.AS·January 17, 2025

Beyond Speaker Identity: Text Guided Target Speech Extraction

Mingyue Huo, Abhinav Jain, Cong Phuoc Huynh, Fanjie Kong, Pichao Wang,, Zhu Liu, Vimal Bhat

PDF

Open Access 1 Repo

TL;DR

This paper introduces StyleTSE, a novel text-guided target speech extraction model that leverages natural language descriptions of speaking style, enabling speech separation without explicit speaker identity clues.

Contribution

The paper presents StyleTSE, the first model to incorporate natural language descriptions for target speech extraction, along with a new dataset TextrolMix for training and evaluation.

Findings

01

Effective speech separation based on speaking style and content.

02

Outperforms traditional methods relying solely on audio clues.

03

Demonstrates robustness in scenarios lacking explicit speaker identity information.

Abstract

Target Speech Extraction (TSE) traditionally relies on explicit clues about the speaker's identity like enrollment audio, face images, or videos, which may not always be available. In this paper, we propose a text-guided TSE model StyleTSE that uses natural language descriptions of speaking style in addition to the audio clue to extract the desired speech from a given mixture. Our model integrates a speech separation network adapted from SepFormer with a bi-modality clue network that flexibly processes both audio and text clues. To train and evaluate our model, we introduce a new dataset TextrolMix with speech mixtures and natural language descriptions. Experimental results demonstrate that our method effectively separates speech based not only on who is speaking, but also on how they are speaking, enhancing TSE in scenarios where traditional audio clues are absent. Demos are at:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mingyue66/textrolmix
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis

MethodsAttention Is All You Need · Layer Normalization · Dense Connections · *Communicated@Fast*How Do I Communicate to Expedia? · Residual Connection · Softmax · Parameterized ReLU · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer