Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading
Xinshuo Weng, Kris Kitani

TL;DR
This paper enhances lipreading accuracy by replacing shallow 3D CNNs with deep two-stream I3D networks, leveraging optical flow and grayscale inputs, and pre-training on large datasets, achieving significant performance gains.
Contribution
It introduces a two-stream deep 3D CNN (I3D) front-end for lipreading, demonstrating improved accuracy over prior shallow models through pre-training and multi-stream inputs.
Findings
Deep 3D CNNs outperform shallow models.
Optical flow alone achieves comparable results.
Two-stream approach further improves accuracy.
Abstract
We focus on the word-level visual lipreading, which requires recognizing the word being spoken, given only the video but not the audio. State-of-the-art methods explore the use of end-to-end neural networks, including a shallow (up to three layers) 3D convolutional neural network (CNN) + a deep 2D CNN (e.g., ResNet) as the front-end to extract visual features, and a recurrent neural network (e.g., bidirectional LSTM) as the back-end for classification. In this work, we propose to replace the shallow 3D CNNs + deep 2D CNNs front-end with recent successful deep 3D CNNs --- two-stream (i.e., grayscale video and optical flow streams) I3D. We evaluate different combinations of front-end and back-end modules with the grayscale video and optical flow inputs on the LRW dataset. The experiments show that, compared to the shallow 3D CNNs + deep 2D CNNs front-end, the deep 3D CNNs front-end with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Speech and Audio Processing · Human Pose and Action Recognition
