Decoding visemes: improving machine lipreading

Helen L Bear

arXiv:1710.01288·cs.CV·May 9, 2018

Decoding visemes: improving machine lipreading

Helen L Bear

PDF

TL;DR

This paper demonstrates that high-definition video is unnecessary for effective machine lipreading, introduces a new speaker-dependent phoneme-to-viseme mapping method, and improves lipreading accuracy by optimizing viseme sets.

Contribution

It proposes a novel speaker-dependent phoneme-to-viseme mapping approach and shows how optimizing viseme sets enhances lipreading accuracy across speakers.

Findings

01

HD video is not essential for lipreading accuracy

02

Optimal viseme set size varies by speaker, ranging from 11 to 35

03

Hierarchical training with optimized visemes significantly improves classification

Abstract

Machine lipreading (MLR) is speech recognition from visual cues and a niche research problem in speech processing & computer vision. Current challenges fall into two groups: the content of the video, such as rate of speech or; the parameters of the video recording e.g, video resolution. We show that HD video is not needed to successfully lipread with a computer. The term "viseme" is used in machine lipreading to represent a visual cue or gesture which corresponds to a subgroup of phonemes where the phonemes are visually indistinguishable. A phoneme is the smallest sound one can utter, because there are more phonemes per viseme, maps between units show a many-to-one relationship. Many maps have been presented, we compare these and our results show Lee's is best. We propose a new method of speaker-dependent phoneme-to-viseme maps and compare these to Lee's. Our results show the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.