End-to-End Multi-View Lipreading
Stavros Petridis, Yujiang Wang, Zuwei Li, Maja Pantic

TL;DR
This paper introduces an end-to-end multi-view lipreading system using BLSTM networks that learns directly from pixel data, effectively combining multiple views to significantly improve speech recognition accuracy.
Contribution
It presents the first model to simultaneously learn feature extraction and speech classification from multiple views in an end-to-end manner, achieving state-of-the-art results.
Findings
Achieves 96.9% accuracy on OuluVS2 database.
Improves performance by 3-3.8% over frontal view alone.
Outperforms previous multi-view lipreading methods.
Abstract
Non-frontal lip views contain useful information which can be used to enhance the performance of frontal view lipreading. However, the vast majority of recent lipreading works, including the deep learning approaches which significantly outperform traditional approaches, have focused on frontal mouth images. As a consequence, research on joint learning of visual features and speech classification from multiple views is limited. In this work, we present an end-to-end multi-view lipreading system based on Bidirectional Long-Short Memory (BLSTM) networks. To the best of our knowledge, this is the first model which simultaneously learns to extract features directly from the pixels and performs visual speech classification from multiple views and also achieves state-of-the-art performance. The model consists of multiple identical streams, one for each view, which extract features directly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Face recognition and analysis
