AvaTr: One-Shot Speaker Extraction with Transformers
Shell Xu Hu, Md Rifat Arefin, Viet-Nhat Nguyen, Alish Dipani, Xaq, Pitkow, Andreas Savas Tolias

TL;DR
AvaTr introduces a Transformer-based approach for one-shot speaker extraction that leverages voice characteristics for selective attention, achieving state-of-the-art results even on unseen speakers.
Contribution
The paper presents two novel Transformer models that incorporate voice characteristics for effective one-shot speaker extraction, outperforming existing methods.
Findings
Achieves state-of-the-art performance on speaker extraction benchmarks.
Effective in extracting voices of unseen speakers.
Models outperform previous approaches in various noisy conditions.
Abstract
To extract the voice of a target speaker when mixed with a variety of other sounds, such as white and ambient noises or the voices of interfering speakers, we extend the Transformer network to attend the most relevant information with respect to the target speaker given the characteristics of his or her voices as a form of contextual information. The idea has a natural interpretation in terms of the selective attention theory. Specifically, we propose two models to incorporate the voice characteristics in Transformer based on different insights of where the feature selection should take place. Both models yield excellent performance, on par or better than published state-of-the-art models on the speaker extraction task, including separating speech of novel speakers not seen during training.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
