Beyond Lips: Integrating Gesture and Lip Cues for Robust Audio-visual Speaker Extraction
Zexu Pan, Xinyuan Qian, Shengkui Zhao, Kun Zhou, Bin Ma

TL;DR
This paper introduces SeLG, a novel audio-visual speaker extraction model that combines lip and gesture cues using cross-attention and contrastive learning, improving robustness especially when facial cues are occluded or distant.
Contribution
SeLG is the first model to effectively integrate both lip and upper-body gesture information with a cross-attention mechanism and contrastive loss for enhanced speaker extraction.
Findings
Significant performance improvement over baselines on YGD dataset.
Effective fusion of lip and gesture cues enhances robustness.
Contrastive loss aligns gesture embeddings with speech-related lip features.
Abstract
Most audio-visual speaker extraction methods rely on synchronized lip recording to isolate the speech of a target speaker from a multi-talker mixture. However, in natural human communication, co-speech gestures are also temporally aligned with speech, often emphasizing specific words or syllables. These gestures provide complementary visual cues that can be especially valuable when facial or lip regions are occluded or distant. In this work, we move beyond lip-centric approaches and propose SeLG, a model that integrates both lip and upper-body gesture information for robust speaker extraction. SeLG features a cross-attention-based fusion mechanism that enables each visual modality to query and selectively attend to relevant speech features in the mixture. To improve the alignment of gesture representations with speech dynamics, SeLG also employs a contrastive InfoNCE loss that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Speech Recognition and Synthesis
