Can DNNs Learn to Lipread Full Sentences?

George Sterpu; Christian Saam; Naomi Harte

arXiv:1805.11685·eess.IV·May 31, 2018

Can DNNs Learn to Lipread Full Sentences?

George Sterpu, Christian Saam, Naomi Harte

PDF

Open Access

TL;DR

This paper investigates deep neural network architectures for full sentence lipreading, demonstrating significant improvements over traditional models and showing that networks can learn to lipread rather than just language models.

Contribution

It introduces a sequence-to-sequence recurrent neural network approach with advanced visual front-ends and combined loss functions for lipreading, achieving notable performance gains.

Findings

01

Major improvement over Hidden Markov Model frameworks

02

Network learns to lipread beyond language modeling

03

Effective use of CNN front-ends and attention mechanisms

Abstract

Finding visual features and suitable models for lipreading tasks that are more complex than a well-constrained vocabulary has proven challenging. This paper explores state-of-the-art Deep Neural Network architectures for lipreading based on a Sequence to Sequence Recurrent Neural Network. We report results for both hand-crafted and 2D/3D Convolutional Neural Network visual front-ends, online monotonic attention, and a joint Connectionist Temporal Classification-Sequence-to-Sequence loss. The system is evaluated on the publicly available TCD-TIMIT dataset, with 59 speakers and a vocabulary of over 6000 words. Results show a major improvement on a Hidden Markov Model framework. A fuller analysis of performance across visemes demonstrates that the network is not only learning the language model, but actually learning to lipread.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHearing Impairment and Communication · Speech and Audio Processing · Tactile and Sensory Interactions