Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction
Zexu Pan, Shengkui Zhao, Tingting Wang, Kun Zhou, Yukun Ma, Chong Zhang, Bin Ma

TL;DR
This paper presents a plug-and-play attention module that leverages multiple co-occurring faces to improve audio-visual speaker extraction, demonstrating enhanced accuracy and robustness across various complex multi-person scenarios.
Contribution
Introduces a novel inter-speaker attention module for flexible processing of multiple faces, integrated into existing models to improve speaker extraction in multi-person environments.
Findings
Outperforms baseline models on VoxCeleb2 and MISP datasets.
Shows robustness and generalizability across LRS2 and LRS3 datasets.
Enhances speaker extraction accuracy in complex scenes.
Abstract
Audio-visual speaker extraction isolates a target speaker's speech from a mixture speech signal conditioned on a visual cue, typically using the target speaker's face recording. However, in real-world scenarios, other co-occurring faces are often present on-screen, providing valuable speaker activity cues in the scene. In this work, we introduce a plug-and-play inter-speaker attention module to process these flexible numbers of co-occurring faces, allowing for more accurate speaker extraction in complex multi-person environments. We integrate our module into two prominent models: the AV-DPRNN and the state-of-the-art AV-TFGridNet. Extensive experiments on diverse datasets, including the highly overlapped VoxCeleb2 and sparsely overlapped MISP, demonstrate that our approach consistently outperforms baselines. Furthermore, cross-dataset evaluations on LRS2 and LRS3 confirm the robustness…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Speech Recognition and Synthesis
MethodsSoftmax · Attention Is All You Need
