TL;DR
This paper introduces a deep learning architecture that generates word embeddings from visual speech data, effectively recognizing words while reducing variability from speaker, pose, and illumination, and enabling recognition of unseen words.
Contribution
The paper presents a novel deep architecture for visual speech recognition that produces effective word embeddings and demonstrates zero-shot recognition capabilities.
Findings
Achieved 11.92% error rate on 500-word closed-set recognition
Embeddings effectively model unseen words in low-shot learning scenarios
System surpasses previous state-of-the-art in visual speech recognition
Abstract
In this paper we present a deep learning architecture for extracting word embeddings for visual speech recognition. The embeddings summarize the information of the mouth region that is relevant to the problem of word recognition, while suppressing other types of variability such as speaker, pose and illumination. The system is comprised of a spatiotemporal convolutional layer, a Residual Network and bidirectional LSTMs and is trained on the Lipreading in-the-wild database. We first show that the proposed architecture goes beyond state-of-the-art on closed-set word identification, by attaining 11.92% error rate on a vocabulary of 500 words. We then examine the capacity of the embeddings in modelling words unseen during training. We deploy Probabilistic Linear Discriminant Analysis (PLDA) to model the embeddings and perform low-shot learning experiments on words unseen during training.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
