CueNet: Robust Audio-Visual Speaker Extraction through Cross-Modal Cue Mining and Interaction
Jiadong Wang, Ke Zhang, Xinyuan Qian, Ruijie Tao, Haizhou Li, Bj\"orn Schuller

TL;DR
CueNet is a novel audio-visual speaker extraction model that enhances robustness against visual impairments by disentangling cues and integrating them through a dedicated interaction module, outperforming existing methods in degraded scenarios.
Contribution
The paper introduces CueNet, a new approach that improves robustness to visual degradation without training on degraded videos by disentangling and integrating multiple cues.
Findings
CueNet outperforms existing methods under various visual degradations.
Disentangling cues enhances robustness without degraded training data.
Interaction module effectively combines cues for reliable speaker extraction.
Abstract
Audio-visual speaker extraction has attracted increasing attention, as it removes the need for pre-registered speech and leverages the visual modality as a complement to audio. Although existing methods have achieved impressive performance, the issue of degraded visual inputs has received relatively little attention, despite being common in real-world scenarios. Previous attempts to address this problem have mainly involved training with degraded visual data. However, visual degradation can occur in many unpredictable ways, making it impractical to simulate all possible cases during training. In this paper, we aim to enhance the robustness of audio-visual speaker extraction against impaired visual inputs without relying on degraded videos during training. Inspired by observations from human perceptual mechanisms, we propose an audio-visual learner that disentangles speaker information,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Video Analysis and Summarization
