Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction

Zexu Pan; Shengkui Zhao; Tingting Wang; Kun Zhou; Yukun Ma; Chong Zhang; Bin Ma

arXiv:2505.20635·eess.AS·May 28, 2025

Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction

Zexu Pan, Shengkui Zhao, Tingting Wang, Kun Zhou, Yukun Ma, Chong Zhang, Bin Ma

PDF

Open Access

TL;DR

This paper presents a plug-and-play attention module that leverages multiple co-occurring faces to improve audio-visual speaker extraction, demonstrating enhanced accuracy and robustness across various complex multi-person scenarios.

Contribution

Introduces a novel inter-speaker attention module for flexible processing of multiple faces, integrated into existing models to improve speaker extraction in multi-person environments.

Findings

01

Outperforms baseline models on VoxCeleb2 and MISP datasets.

02

Shows robustness and generalizability across LRS2 and LRS3 datasets.

03

Enhances speaker extraction accuracy in complex scenes.

Abstract

Audio-visual speaker extraction isolates a target speaker's speech from a mixture speech signal conditioned on a visual cue, typically using the target speaker's face recording. However, in real-world scenarios, other co-occurring faces are often present on-screen, providing valuable speaker activity cues in the scene. In this work, we introduce a plug-and-play inter-speaker attention module to process these flexible numbers of co-occurring faces, allowing for more accurate speaker extraction in complex multi-person environments. We integrate our module into two prominent models: the AV-DPRNN and the state-of-the-art AV-TFGridNet. Extensive experiments on diverse datasets, including the highly overlapped VoxCeleb2 and sparsely overlapped MISP, demonstrate that our approach consistently outperforms baselines. Furthermore, cross-dataset evaluations on LRS2 and LRS3 confirm the robustness…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Speech Recognition and Synthesis

MethodsSoftmax · Attention Is All You Need