Towards Lipreading Sentences with Active Appearance Models
George Sterpu, Naomi Harte

TL;DR
This paper evaluates lipreading features using the TCD-TIMIT dataset, finding DCT features outperform AAM in viseme recognition, but overall accuracy remains low, indicating a need for new visual feature modeling approaches.
Contribution
It compares DCT and AAM features for lipreading on a large vocabulary dataset and highlights the limitations of current visual features.
Findings
DCT outperforms AAM by over 6% in viseme recognition.
Overall viseme recognition accuracy is only 32-34%.
A fundamental rethink of visual feature modeling is needed.
Abstract
Automatic lipreading has major potential impact for speech recognition, supplementing and complementing the acoustic modality. Most attempts at lipreading have been performed on small vocabulary tasks, due to a shortfall of appropriate audio-visual datasets. In this work we use the publicly available TCD-TIMIT database, designed for large vocabulary continuous audio-visual speech recognition. We compare the viseme recognition performance of the most widely used features for lipreading, Discrete Cosine Transform (DCT) and Active Appearance Models (AAM), in a traditional Hidden Markov Model (HMM) framework. We also exploit recent advances in AAM fitting. We found the DCT to outperform AAM by more than 6% for a viseme recognition task with 56 speakers. The overall accuracy of the DCT is quite low (32-34%). We conclude that a fundamental rethink of the modelling of visual features may be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
