Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement
Tathagata Bandyopadhyay

TL;DR
Spectron is a transformer-based model for target speaker extraction from mixed audio, employing adversarial refinement and novel training objectives to improve speech quality and outperform existing methods.
Contribution
Introduces a transformer-based target speaker extraction model with adversarial refinement and new training objectives for speaker embedding consistency and waveform invertibility.
Findings
Improves speech extraction quality by 3.12 dB over CNN baseline.
Outperforms recent state-of-the-art methods by 4.1 dB on average.
Leverages multi-scale discriminator for perceptual quality enhancement.
Abstract
Recently, attention-based transformers have become a de facto standard in many deep learning applications including natural language processing, computer vision, signal processing, etc.. In this paper, we propose a transformer-based end-to-end model to extract a target speaker's speech from a monaural multi-speaker mixed audio signal. Unlike existing speaker extraction methods, we introduce two additional objectives to impose speaker embedding consistency and waveform encoder invertibility and jointly train both speaker encoder and speech separator to better capture the speaker conditional embedding. Furthermore, we leverage a multi-scale discriminator to refine the perceptual quality of the extracted speech. Our experiments show that the use of a dual path transformer in the separator backbone along with proposed training paradigm improves the CNN baseline by dB points. Finally,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
