Multi-Grained Spatio-temporal Modeling for Lip-reading

Chenhao Wang

arXiv:1908.11618·cs.CV·September 4, 2019·40 cites

Multi-Grained Spatio-temporal Modeling for Lip-reading

Chenhao Wang

PDF

Open Access

TL;DR

This paper introduces a multi-grained spatio-temporal model for lip-reading that captures detailed and broad speech patterns, improving recognition accuracy across diverse speakers and conditions.

Contribution

It proposes a novel multi-level feature extraction and a bidirectional ConvLSTM with attention for robust lip-reading, addressing challenges of similar lip movements and speaker variability.

Findings

01

Effective in distinguishing words with similar phonemes

02

Robust to speaker and lighting variations

03

Outperforms existing methods on benchmark datasets

Abstract

Lip-reading aims to recognize speech content from videos via visual analysis of speakers' lip movements. This is a challenging task due to the existence of homophemes-words which involve identical or highly similar lip movements, as well as diverse lip appearances and motion patterns among the speakers. To address these challenges, we propose a novel lip-reading model which captures not only the nuance between words but also styles of different speakers, by a multi-grained spatio-temporal modeling of the speaking process. Specifically, we first extract both frame-level fine-grained features and short-term medium-grained features by the visual front-end, which are then combined to obtain discriminative representations for words with similar phonemes. Next, a bidirectional ConvLSTM augmented with temporal attention aggregates spatio-temporal information in the entire input sequence, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Hand Gesture Recognition Systems

MethodsTanh Activation · Sigmoid Activation · Convolution · ConvLSTM