Some observations on computer lip-reading: moving from the dream to the   reality

Helen L. Bear; Gari Owen; Richard Harvey; and Barry-John Theobald

arXiv:1710.01084·cs.CV·April 26, 2018

Some observations on computer lip-reading: moving from the dream to the reality

Helen L. Bear, Gari Owen, Richard Harvey, and Barry-John Theobald

PDF

Open Access

TL;DR

This paper critically examines common assumptions in computer lip-reading, revealing that factors like video quality are less limiting than previously thought, but questioning the effectiveness of visemes as recognition units.

Contribution

It challenges the prevailing reliance on visemes for lip-reading and suggests that alternative units may improve system performance.

Findings

01

Video resolution and lighting have limited impact on lip-reading accuracy.

02

Visemes may not be the optimal units for recognition in modern systems.

03

Practical factors are less constraining than theoretical assumptions.

Abstract

In the quest for greater computer lip-reading performance there are a number of tacit assumptions which are either present in the datasets (high resolution for example) or in the methods (recognition of spoken visual units called visemes for example). Here we review these and other assumptions and show the surprising result that computer lip-reading is not heavily constrained by video resolution, pose, lighting and other practical factors. However, the working assumption that visemes, which are the visual equivalent of phonemes, are the best unit for recognition does need further examination. We conclude that visemes, which were defined over a century ago, are unlikely to be optimal for a modern computer lip-reading system.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Music and Audio Processing