Multi-Grained Spatio-temporal Modeling for Lip-reading
Chenhao Wang

TL;DR
This paper introduces a multi-grained spatio-temporal model for lip-reading that captures detailed and broad speech patterns, improving recognition accuracy across diverse speakers and conditions.
Contribution
It proposes a novel multi-level feature extraction and a bidirectional ConvLSTM with attention for robust lip-reading, addressing challenges of similar lip movements and speaker variability.
Findings
Effective in distinguishing words with similar phonemes
Robust to speaker and lighting variations
Outperforms existing methods on benchmark datasets
Abstract
Lip-reading aims to recognize speech content from videos via visual analysis of speakers' lip movements. This is a challenging task due to the existence of homophemes-words which involve identical or highly similar lip movements, as well as diverse lip appearances and motion patterns among the speakers. To address these challenges, we propose a novel lip-reading model which captures not only the nuance between words but also styles of different speakers, by a multi-grained spatio-temporal modeling of the speaking process. Specifically, we first extract both frame-level fine-grained features and short-term medium-grained features by the visual front-end, which are then combined to obtain discriminative representations for words with similar phonemes. Next, a bidirectional ConvLSTM augmented with temporal attention aggregates spatio-temporal information in the entire input sequence, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Hand Gesture Recognition Systems
MethodsTanh Activation · Sigmoid Activation · Convolution · ConvLSTM
