Learning Speaker-Invariant Visual Features for Lipreading

Yu Li; Feng Xue; Shujie Li; Jinrui Zhang; Shuang Yang; Dan Guo; Richang Hong

arXiv:2506.07572·cs.CV·June 10, 2025

Learning Speaker-Invariant Visual Features for Lipreading

Yu Li, Feng Xue, Shujie Li, Jinrui Zhang, Shuang Yang, Dan Guo, Richang Hong

PDF

Open Access

TL;DR

This paper introduces SIFLip, a novel framework for lipreading that learns speaker-invariant visual features by disentangling speaker-specific attributes, thereby improving cross-speaker generalization and accuracy.

Contribution

SIFLip employs implicit and explicit disentanglement modules to effectively decouple speaker-specific features from visual lip representations, enhancing lipreading performance.

Findings

01

Outperforms state-of-the-art methods on multiple datasets

02

Significantly improves cross-speaker generalization

03

Effectively disentangles speaker-specific attributes

Abstract

Lipreading is a challenging cross-modal task that aims to convert visual lip movements into spoken text. Existing lipreading methods often extract visual features that include speaker-specific lip attributes (e.g., shape, color, texture), which introduce spurious correlations between vision and text. These correlations lead to suboptimal lipreading accuracy and restrict model generalization. To address this challenge, we introduce SIFLip, a speaker-invariant visual feature learning framework that disentangles speaker-specific attributes using two complementary disentanglement modules (Implicit Disentanglement and Explicit Disentanglement) to improve generalization. Specifically, since different speakers exhibit semantic consistency between lip movements and phonetic text when pronouncing the same words, our implicit disentanglement module leverages stable text embeddings as supervisory…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Facial Nerve Paralysis Treatment and Research