Which phoneme-to-viseme maps best improve visual-only computer lip-reading?
Helen L. Bear, Richard W. Harvey, Barry-John Theobald, and Yuxuan Lan

TL;DR
This paper evaluates 120 phoneme-to-viseme mappings to determine which best enhance visual-only lip-reading, introducing a new method for creating stable maps based on phoneme confusion data.
Contribution
It presents a novel approach for designing viseme maps from phoneme confusion data and demonstrates improved lip-reading performance for individual talkers.
Findings
Certain viseme mappings outperform others in lip-reading accuracy.
Newly devised maps based on phoneme confusions show better stability across talkers.
The method provides a systematic way to optimize viseme mappings for visual speech recognition.
Abstract
A critical assumption of all current visual speech recognition systems is that there are visual speech units called visemes which can be mapped to units of acoustic speech, the phonemes. Despite there being a number of published maps it is infrequent to see the effectiveness of these tested, particularly on visual-only lip-reading (many works use audio-visual speech). Here we examine 120 mappings and consider if any are stable across talkers. We show a method for devising maps based on phoneme confusions from an automated lip-reading system, and we present new mappings that show improvements for individual talkers.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Music and Audio Processing
