TSELM: Target Speaker Extraction using Discrete Tokens and Language   Models

Beilong Tang; Bang Zeng; Ming Li

arXiv:2409.07841·cs.SD·September 18, 2024

TSELM: Target Speaker Extraction using Discrete Tokens and Language Models

Beilong Tang, Bang Zeng, Ming Li

PDF

Open Access 1 Repo

TL;DR

TSELM introduces a novel target speaker extraction approach that combines discrete tokens, language models, and a generative neural network to improve speech quality and intelligibility.

Contribution

It leverages discretized WavLM features and cross-attention with language models to transform audio generation into a classification task, enhancing speaker extraction.

Findings

01

Achieves high speech quality in extraction

02

Provides comparable speech intelligibility results

03

Transforms audio regression into a classification problem

Abstract

We propose TSELM, a novel target speaker extraction network that leverages discrete tokens and language models. TSELM utilizes multiple discretized layers from WavLM as input tokens and incorporates cross-attention mechanisms to integrate target speaker information. Language models are employed to capture the sequence dependencies, while a scalable HiFi-GAN is used to reconstruct the audio from the tokens. By applying a cross-entropy loss, TSELM models the probability distribution of output tokens, thus converting the complex regression problem of audio generation into a classification task. Experimental results show that TSELM achieves excellent results in speech quality and comparable results in speech intelligibility.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Beilong-Tang/TSELM
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsHiFi-GAN