Lip-reading with Densely Connected Temporal Convolutional Networks
Pingchuan Ma, Yujiang Wang, Jie Shen, Stavros Petridis, Maja Pantic

TL;DR
This paper introduces DC-TCN, a novel densely connected temporal convolutional network with attention mechanisms, achieving state-of-the-art lip-reading accuracy on LRW and LRW-1000 datasets.
Contribution
The paper proposes a new densely connected TCN with attention blocks for improved lip-reading, surpassing existing methods on benchmark datasets.
Findings
Achieved 88.36% accuracy on LRW dataset.
Achieved 43.65% accuracy on LRW-1000 dataset.
Surpassed all baseline methods, setting new state-of-the-art results.
Abstract
In this work, we present the Densely Connected Temporal Convolutional Network (DC-TCN) for lip-reading of isolated words. Although Temporal Convolutional Networks (TCN) have recently demonstrated great potential in many vision tasks, its receptive fields are not dense enough to model the complex temporal dynamics in lip-reading scenarios. To address this problem, we introduce dense connections into the network to capture more robust temporal features. Moreover, our approach utilises the Squeeze-and-Excitation block, a light-weight attention mechanism, to further enhance the model's classification power. Without bells and whistles, our DC-TCN method has achieved 88.36% accuracy on the Lip Reading in the Wild (LRW) dataset and 43.65% on the LRW-1000 dataset, which has surpassed all the baseline methods and is the new state-of-the-art on both datasets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Face recognition and analysis
MethodsDense Connections
