Binaural Selective Attention Model for Target Speaker Extraction

Hanyu Meng; Qiquan Zhang; Xiangyu Zhang; Vidhyasaharan Sethu,; Eliathamby Ambikairajah

arXiv:2406.12236·eess.AS·June 19, 2024

Binaural Selective Attention Model for Target Speaker Extraction

Hanyu Meng, Qiquan Zhang, Xiangyu Zhang, Vidhyasaharan Sethu,, Eliathamby Ambikairajah

PDF

Open Access

TL;DR

This paper introduces a binaural time-domain target speaker extraction model inspired by human selective hearing, which outperforms existing monaural and multi-channel models in cocktail party scenarios.

Contribution

The paper proposes a novel binaural time-domain model with multi-head attention for target speaker extraction, incorporating target speaker embedding and comparing binaural interaction methods.

Findings

01

Outperforms monaural and multi-channel models

02

Achieves 18.52 dB SI-SDR in experiments

03

Uses attention-based binaural interaction methods

Abstract

The remarkable ability of humans to selectively focus on a target speaker in cocktail party scenarios is facilitated by binaural audio processing. In this paper, we present a binaural time-domain Target Speaker Extraction model based on the Filter-and-Sum Network (FaSNet). Inspired by human selective hearing, our proposed model introduces target speaker embedding into separators using a multi-head attention-based selective attention block. We also compared two binaural interaction approaches -- the cosine similarity of time-domain signals and inter-channel correlation in learned spectral representations. Our experimental results show that our proposed model outperforms monaural configurations and state-of-the-art multi-channel target speaker extraction models, achieving best-in-class performance with 18.52 dB SI-SDR, 19.12 dB SDR, and 3.05 PESQ scores under anechoic two-speaker test…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing

MethodsSoftmax · Attention Is All You Need · Focus