Alternative Visual Units for an Optimized Phoneme-Based Lipreading System
Helen Bear, Richard Harvey

TL;DR
This paper introduces a structured method for creating speaker-dependent visual units called visemes, explores their impact on lipreading accuracy, and proposes a two-pass training scheme that enhances phoneme recognition performance.
Contribution
It presents a novel approach to define visemes with fixed sizes, investigates intermediate visual units, and introduces a two-pass training scheme for improved lipreading accuracy.
Findings
Intermediate units outperform visemes and phonemes.
Two-pass training significantly improves recognition accuracy.
Viseme set size affects lipreading performance.
Abstract
Lipreading is understanding speech from observed lip movements. An observed series of lip motions is an ordered sequence of visual lip gestures. These gestures are commonly known, but as yet are not formally defined, as `visemes'. In this article, we describe a structured approach which allows us to create speaker-dependent visemes with a fixed number of visemes within each set. We create sets of visemes for sizes two to 45. Each set of visemes is based upon clustering phonemes, thus each set has a unique phoneme-to-viseme mapping. We first present an experiment using these maps and the Resource Management Audio-Visual (RMAV) dataset which shows the effect of changing the viseme map size in speaker-dependent machine lipreading and demonstrate that word recognition with phoneme classifiers is possible. Furthermore, we show that there are intermediate units between visemes and phonemes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
