Visual Speech Language Models

Helen L Bear

arXiv:1809.06800·eess.AS·September 19, 2018

Visual Speech Language Models

Helen L Bear

PDF

TL;DR

This paper investigates how different linguistic units—visemes, phonemes, and words—affect the performance of visual speech language models in lipreading, considering the impact of visual co-articulation effects.

Contribution

It compares the effectiveness of visemes, phonemes, and words as units in visual speech language models to identify optimal choices for lipreading systems.

Findings

01

Visemes show limitations due to visual co-articulation effects.

02

Phonemes and words may offer better modeling options.

03

Trade-offs between model complexity and accuracy are discussed.

Abstract

Language models (LM) are very powerful in lipreading systems. Language models built upon the ground truth utterances of datasets learn grammar and structure rules of words and sentences (the latter in the case of continuous speech). However, visual co-articulation effects in visual speech signals damage the performance of visual speech LM's as visually, people do not utter what the language model expects. These models are commonplace but while higher-order N-gram LM's may improve classification rates, the cost of this model is disproportionate to the common goal of developing more accurate classifiers. So we compare which unit would best optimize a lipreading (visual speech) LM to observe their limitations. We compare three units; visemes (visual speech units) \cite{lan2010improving}, phonemes (audible speech units), and words.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.