Speaker-independent machine lip-reading with speaker-dependent viseme classifiers
Helen L. Bear, Stephen J. Cox, Richard W. Harvey

TL;DR
This paper investigates speaker-independent machine lip-reading by creating speaker-dependent viseme classifiers, revealing that while speakers share similar mouth gestures, their usage varies, impacting lip-reading accuracy.
Contribution
The study introduces a phoneme-clustering method to form phoneme-to-viseme maps for individual and multiple speakers, advancing speaker-independent lip-reading techniques.
Findings
Speakers share similar mouth gestures but differ in their usage.
Speaker-dependent viseme classifiers improve lip-reading accuracy.
Visual speech is highly speaker-dependent, affecting model generalization.
Abstract
In machine lip-reading, which is identification of speech from visual-only information, there is evidence to show that visual speech is highly dependent upon the speaker [1]. Here, we use a phoneme-clustering method to form new phoneme-to-viseme maps for both individual and multiple speakers. We use these maps to examine how similarly speakers talk visually. We conclude that broadly speaking, speakers have the same repertoire of mouth gestures, where they differ is in the use of the gestures.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Multisensory perception and integration
