TL;DR
This paper introduces a new deep learning model and a large dataset for open-world lip reading of natural sentences in videos, achieving state-of-the-art results and surpassing professional lip readers.
Contribution
The paper presents a novel 'Watch, Listen, Attend and Spell' model, a curriculum learning strategy, and a large 'Lip Reading Sentences' dataset for unconstrained visual speech recognition.
Findings
The WLAS model outperforms previous lip reading methods on benchmark datasets.
The model surpasses professional lip readers on BBC videos.
Visual information improves speech recognition even with audio available.
Abstract
The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem - unconstrained natural language sentences, and in the wild videos. Our key contributions are: (1) a 'Watch, Listen, Attend and Spell' (WLAS) network that learns to transcribe videos of mouth motion to characters; (2) a curriculum learning strategy to accelerate training and to reduce overfitting; (3) a 'Lip Reading Sentences' (LRS) dataset for visual speech recognition, consisting of over 100,000 natural sentences from British television. The WLAS model trained on the LRS dataset surpasses the performance of all previous work on standard lip reading benchmark datasets, often by a significant margin. This lip reading…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Lip Reading Sentences in the Wild· youtube
