Beyond Lips: Integrating Gesture and Lip Cues for Robust Audio-visual Speaker Extraction

Zexu Pan; Xinyuan Qian; Shengkui Zhao; Kun Zhou; Bin Ma

arXiv:2601.19130·eess.AS·January 28, 2026

Beyond Lips: Integrating Gesture and Lip Cues for Robust Audio-visual Speaker Extraction

Zexu Pan, Xinyuan Qian, Shengkui Zhao, Kun Zhou, Bin Ma

PDF

Open Access

TL;DR

This paper introduces SeLG, a novel audio-visual speaker extraction model that combines lip and gesture cues using cross-attention and contrastive learning, improving robustness especially when facial cues are occluded or distant.

Contribution

SeLG is the first model to effectively integrate both lip and upper-body gesture information with a cross-attention mechanism and contrastive loss for enhanced speaker extraction.

Findings

01

Significant performance improvement over baselines on YGD dataset.

02

Effective fusion of lip and gesture cues enhances robustness.

03

Contrastive loss aligns gesture embeddings with speech-related lip features.

Abstract

Most audio-visual speaker extraction methods rely on synchronized lip recording to isolate the speech of a target speaker from a multi-talker mixture. However, in natural human communication, co-speech gestures are also temporally aligned with speech, often emphasizing specific words or syllables. These gestures provide complementary visual cues that can be especially valuable when facial or lip regions are occluded or distant. In this work, we move beyond lip-centric approaches and propose SeLG, a model that integrates both lip and upper-body gesture information for robust speaker extraction. SeLG features a cross-attention-based fusion mechanism that enables each visual modality to query and selectively attend to relevant speech features in the mixture. To improve the alignment of gesture representations with speech dynamics, SeLG also employs a contrastive InfoNCE loss that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Speech Recognition and Synthesis