Can DNNs Learn to Lipread Full Sentences?
George Sterpu, Christian Saam, Naomi Harte

TL;DR
This paper investigates deep neural network architectures for full sentence lipreading, demonstrating significant improvements over traditional models and showing that networks can learn to lipread rather than just language models.
Contribution
It introduces a sequence-to-sequence recurrent neural network approach with advanced visual front-ends and combined loss functions for lipreading, achieving notable performance gains.
Findings
Major improvement over Hidden Markov Model frameworks
Network learns to lipread beyond language modeling
Effective use of CNN front-ends and attention mechanisms
Abstract
Finding visual features and suitable models for lipreading tasks that are more complex than a well-constrained vocabulary has proven challenging. This paper explores state-of-the-art Deep Neural Network architectures for lipreading based on a Sequence to Sequence Recurrent Neural Network. We report results for both hand-crafted and 2D/3D Convolutional Neural Network visual front-ends, online monotonic attention, and a joint Connectionist Temporal Classification-Sequence-to-Sequence loss. The system is evaluated on the publicly available TCD-TIMIT dataset, with 59 speakers and a vocabulary of over 6000 words. Results show a major improvement on a Hidden Markov Model framework. A fuller analysis of performance across visemes demonstrates that the network is not only learning the language model, but actually learning to lipread.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHearing Impairment and Communication · Speech and Audio Processing · Tactile and Sensory Interactions
