Audio-Visual Target Speaker Extraction with Reverse Selective Auditory   Attention

Ruijie Tao; Xinyuan Qian; Yidi Jiang; Junjie Li; Jiadong Wang and; Haizhou Li

arXiv:2404.18501·eess.AS·March 4, 2025·1 cites

Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention

Ruijie Tao, Xinyuan Qian, Yidi Jiang, Junjie Li, Jiadong Wang and, Haizhou Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces SEANet, a novel audio-visual speaker extraction framework that employs a reverse selective auditory attention mechanism to suppress noise and interference, achieving state-of-the-art results across multiple datasets.

Contribution

The paper proposes a new selective auditory attention mechanism and a framework called SEANet that effectively suppresses interference in audio-visual speaker extraction tasks.

Findings

01

SEANet outperforms baseline methods on five datasets.

02

The proposed method achieves state-of-the-art results.

03

SEANet effectively suppresses noise and interference.

Abstract

Audio-visual target speaker extraction (AV-TSE) aims to extract the specific person's speech from the audio mixture given auxiliary visual cues. Previous methods usually search for the target voice through speech-lip synchronization. However, this strategy mainly focuses on the existence of target speech, while ignoring the variations of the noise characteristics, i.e., interference speaker and the background noise. That may result in extracting noisy signals from the incorrect sound source in challenging acoustic situations. To this end, we propose a novel selective auditory attention mechanism, which can suppress interference speakers and non-speech signals to avoid incorrect speaker extraction. By estimating and utilizing the undesired noisy signal through this mechanism, we design an AV-TSE framework named Subtraction-and-ExtrAction network (SEANet) to suppress the noisy signals. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

taoruijie/seanet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis