Binaural Selective Attention Model for Target Speaker Extraction
Hanyu Meng, Qiquan Zhang, Xiangyu Zhang, Vidhyasaharan Sethu,, Eliathamby Ambikairajah

TL;DR
This paper introduces a binaural time-domain target speaker extraction model inspired by human selective hearing, which outperforms existing monaural and multi-channel models in cocktail party scenarios.
Contribution
The paper proposes a novel binaural time-domain model with multi-head attention for target speaker extraction, incorporating target speaker embedding and comparing binaural interaction methods.
Findings
Outperforms monaural and multi-channel models
Achieves 18.52 dB SI-SDR in experiments
Uses attention-based binaural interaction methods
Abstract
The remarkable ability of humans to selectively focus on a target speaker in cocktail party scenarios is facilitated by binaural audio processing. In this paper, we present a binaural time-domain Target Speaker Extraction model based on the Filter-and-Sum Network (FaSNet). Inspired by human selective hearing, our proposed model introduces target speaker embedding into separators using a multi-head attention-based selective attention block. We also compared two binaural interaction approaches -- the cosine similarity of time-domain signals and inter-channel correlation in learned spectral representations. Our experimental results show that our proposed model outperforms monaural configurations and state-of-the-art multi-channel target speaker extraction models, achieving best-in-class performance with 18.52 dB SI-SDR, 19.12 dB SDR, and 3.05 PESQ scores under anechoic two-speaker test…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
MethodsSoftmax · Attention Is All You Need · Focus
