CueNet: Robust Audio-Visual Speaker Extraction through Cross-Modal Cue Mining and Interaction

Jiadong Wang; Ke Zhang; Xinyuan Qian; Ruijie Tao; Haizhou Li; Bj\"orn Schuller

arXiv:2603.01530·cs.MM·March 3, 2026

CueNet: Robust Audio-Visual Speaker Extraction through Cross-Modal Cue Mining and Interaction

Jiadong Wang, Ke Zhang, Xinyuan Qian, Ruijie Tao, Haizhou Li, Bj\"orn Schuller

PDF

Open Access

TL;DR

CueNet is a novel audio-visual speaker extraction model that enhances robustness against visual impairments by disentangling cues and integrating them through a dedicated interaction module, outperforming existing methods in degraded scenarios.

Contribution

The paper introduces CueNet, a new approach that improves robustness to visual degradation without training on degraded videos by disentangling and integrating multiple cues.

Findings

01

CueNet outperforms existing methods under various visual degradations.

02

Disentangling cues enhances robustness without degraded training data.

03

Interaction module effectively combines cues for reliable speaker extraction.

Abstract

Audio-visual speaker extraction has attracted increasing attention, as it removes the need for pre-registered speech and leverages the visual modality as a complement to audio. Although existing methods have achieved impressive performance, the issue of degraded visual inputs has received relatively little attention, despite being common in real-world scenarios. Previous attempts to address this problem have mainly involved training with degraded visual data. However, visual degradation can occur in many unpredictable ways, making it impractical to simulate all possible cases during training. In this paper, we aim to enhance the robustness of audio-visual speaker extraction against impaired visual inputs without relying on degraded videos during training. Inspired by observations from human perceptual mechanisms, we propose an audio-visual learner that disentangles speaker information,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Video Analysis and Summarization