Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for   Lipreading

Xinshuo Weng; Kris Kitani

arXiv:1905.02540·cs.CV·July 22, 2019·45 cites

Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading

Xinshuo Weng, Kris Kitani

PDF

Open Access

TL;DR

This paper enhances lipreading accuracy by replacing shallow 3D CNNs with deep two-stream I3D networks, leveraging optical flow and grayscale inputs, and pre-training on large datasets, achieving significant performance gains.

Contribution

It introduces a two-stream deep 3D CNN (I3D) front-end for lipreading, demonstrating improved accuracy over prior shallow models through pre-training and multi-stream inputs.

Findings

01

Deep 3D CNNs outperform shallow models.

02

Optical flow alone achieves comparable results.

03

Two-stream approach further improves accuracy.

Abstract

We focus on the word-level visual lipreading, which requires recognizing the word being spoken, given only the video but not the audio. State-of-the-art methods explore the use of end-to-end neural networks, including a shallow (up to three layers) 3D convolutional neural network (CNN) + a deep 2D CNN (e.g., ResNet) as the front-end to extract visual features, and a recurrent neural network (e.g., bidirectional LSTM) as the back-end for classification. In this work, we propose to replace the shallow 3D CNNs + deep 2D CNNs front-end with recent successful deep 3D CNNs --- two-stream (i.e., grayscale video and optical flow streams) I3D. We evaluate different combinations of front-end and back-end modules with the grayscale video and optical flow inputs on the LRW dataset. The experiments show that, compared to the shallow 3D CNNs + deep 2D CNNs front-end, the deep 3D CNNs front-end with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Speech and Audio Processing · Human Pose and Action Recognition