LipNet: End-to-End Sentence-level Lipreading

Yannis M. Assael; Brendan Shillingford; Shimon Whiteson; Nando de; Freitas

arXiv:1611.01599·cs.LG·December 19, 2016·190 cites

LipNet: End-to-End Sentence-level Lipreading

Yannis M. Assael, Brendan Shillingford, Shimon Whiteson, Nando de, Freitas

PDF

Open Access 5 Repos

TL;DR

LipNet is an innovative end-to-end deep learning model that accurately predicts entire sentences from lip movements, outperforming previous models and human lipreaders on a standard dataset.

Contribution

LipNet is the first end-to-end model for sentence-level lipreading that jointly learns visual features and sequence modeling using spatiotemporal convolutions and CTC loss.

Findings

01

Achieves 95.2% accuracy on GRID corpus sentence task

02

Outperforms previous word-level state-of-the-art and human lipreaders

03

Demonstrates the effectiveness of end-to-end training for lipreading

Abstract

Lipreading is the task of decoding text from the movement of a speaker's mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al., 2016; Chung & Zisserman, 2016a). However, existing work on models trained end-to-end perform only word classification, rather than sentence-level sequence prediction. Studies have shown that human lipreading performance increases for longer words (Easton & Basala, 1982), indicating the importance of features capturing temporal context in an ambiguous communication channel. Motivated by this observation, we present LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis