
TL;DR
This paper investigates how different linguistic units—visemes, phonemes, and words—affect the performance of visual speech language models in lipreading, considering the impact of visual co-articulation effects.
Contribution
It compares the effectiveness of visemes, phonemes, and words as units in visual speech language models to identify optimal choices for lipreading systems.
Findings
Visemes show limitations due to visual co-articulation effects.
Phonemes and words may offer better modeling options.
Trade-offs between model complexity and accuracy are discussed.
Abstract
Language models (LM) are very powerful in lipreading systems. Language models built upon the ground truth utterances of datasets learn grammar and structure rules of words and sentences (the latter in the case of continuous speech). However, visual co-articulation effects in visual speech signals damage the performance of visual speech LM's as visually, people do not utter what the language model expects. These models are commonplace but while higher-order N-gram LM's may improve classification rates, the cost of this model is disproportionate to the common goal of developing more accurate classifiers. So we compare which unit would best optimize a lipreading (visual speech) LM to observe their limitations. We compare three units; visemes (visual speech units) \cite{lan2010improving}, phonemes (audible speech units), and words.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
