Resolution limits on visual speech recognition
Helen L. Bear, Richard Harvey, Barry-John Theobald, and Yuxuan Lan

TL;DR
This study systematically investigates how video resolution impacts the accuracy of visual speech recognition, revealing that high resolution is not always necessary but a minimum pixel threshold is critical for reliable lip-reading.
Contribution
The paper introduces a new dataset and provides the first systematic analysis of resolution effects on lip-reading accuracy, challenging assumptions about resolution requirements.
Findings
Recognition accuracy remains acceptable at lower resolutions.
A minimum of four pixels between lip landmarks is necessary for reliable recognition.
Video resolution is less critical than previously thought for automatic lip-reading.
Abstract
Visual-only speech recognition is dependent upon a number of factors that can be difficult to control, such as: lighting; identity; motion; emotion and expression. But some factors, such as video resolution are controllable, so it is surprising that there is not yet a systematic study of the effect of resolution on lip-reading. Here we use a new data set, the Rosetta Raven data, to train and test recognizers so we can measure the affect of video resolution on recognition accuracy. We conclude that, contrary to common practice, resolution need not be that great for automatic lip-reading. However it is highly unlikely that automatic lip-reading can work reliably when the distance between the bottom of the lower lip and the top of the upper lip is less than four pixels at rest.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Indoor and Outdoor Localization Technologies
