Spectron: Target Speaker Extraction using Conditional Transformer with   Adversarial Refinement

Tathagata Bandyopadhyay

arXiv:2409.01352·cs.SD·September 4, 2024

Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement

Tathagata Bandyopadhyay

PDF

Open Access 1 Repo

TL;DR

Spectron is a transformer-based model for target speaker extraction from mixed audio, employing adversarial refinement and novel training objectives to improve speech quality and outperform existing methods.

Contribution

Introduces a transformer-based target speaker extraction model with adversarial refinement and new training objectives for speaker embedding consistency and waveform invertibility.

Findings

01

Improves speech extraction quality by 3.12 dB over CNN baseline.

02

Outperforms recent state-of-the-art methods by 4.1 dB on average.

03

Leverages multi-scale discriminator for perceptual quality enhancement.

Abstract

Recently, attention-based transformers have become a de facto standard in many deep learning applications including natural language processing, computer vision, signal processing, etc.. In this paper, we propose a transformer-based end-to-end model to extract a target speaker's speech from a monaural multi-speaker mixed audio signal. Unlike existing speaker extraction methods, we introduce two additional objectives to impose speaker embedding consistency and waveform encoder invertibility and jointly train both speaker encoder and speech separator to better capture the speaker conditional embedding. Furthermore, we leverage a multi-scale discriminator to refine the perceptual quality of the extracted speech. Our experiments show that the use of a dual path transformer in the separator backbone along with proposed training paradigm improves the CNN baseline by $3.12$ dB points. Finally,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tatban/Spectron
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing