
TL;DR
This paper demonstrates that high-definition video is unnecessary for effective machine lipreading, introduces a new speaker-dependent phoneme-to-viseme mapping method, and improves lipreading accuracy by optimizing viseme sets.
Contribution
It proposes a novel speaker-dependent phoneme-to-viseme mapping approach and shows how optimizing viseme sets enhances lipreading accuracy across speakers.
Findings
HD video is not essential for lipreading accuracy
Optimal viseme set size varies by speaker, ranging from 11 to 35
Hierarchical training with optimized visemes significantly improves classification
Abstract
Machine lipreading (MLR) is speech recognition from visual cues and a niche research problem in speech processing & computer vision. Current challenges fall into two groups: the content of the video, such as rate of speech or; the parameters of the video recording e.g, video resolution. We show that HD video is not needed to successfully lipread with a computer. The term "viseme" is used in machine lipreading to represent a visual cue or gesture which corresponds to a subgroup of phonemes where the phonemes are visually indistinguishable. A phoneme is the smallest sound one can utter, because there are more phonemes per viseme, maps between units show a many-to-one relationship. Many maps have been presented, we compare these and our results show Lee's is best. We propose a new method of speaker-dependent phoneme-to-viseme maps and compare these to Lee's. Our results show the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
