Training Strategies for Improved Lip-reading

Pingchuan Ma; Yujiang Wang; Stavros Petridis; Jie Shen; Maja Pantic

arXiv:2209.01383·cs.CV·September 30, 2022

Training Strategies for Improved Lip-reading

Pingchuan Ma, Yujiang Wang, Stavros Petridis, Jie Shen, Maja Pantic

PDF

Open Access 1 Repo

TL;DR

This paper systematically evaluates various training strategies and models for isolated word lip-reading, demonstrating that combining optimal data augmentation, temporal models, and training techniques significantly improves accuracy on the LRW dataset.

Contribution

It provides a comprehensive analysis of the impact of different training strategies and models, identifying the most effective combination for lip-reading accuracy.

Findings

01

Time Masking is the most impactful augmentation.

02

Densely-Connected Temporal Convolutional Networks outperform other temporal models.

03

Combining all strategies achieves 93.4% accuracy, further improved to 94.1% with pre-training.

Abstract

Several training strategies and temporal models have been recently proposed for isolated word lip-reading in a series of independent works. However, the potential of combining the best strategies and investigating the impact of each of them has not been explored. In this paper, we systematically investigate the performance of state-of-the-art data augmentation approaches, temporal models and other training strategies, like self-distillation and using word boundary indicators. Our results show that Time Masking (TM) is the most important augmentation followed by mixup and Densely-Connected Temporal Convolutional Networks (DC-TCN) are the best temporal model for lip-reading of isolated words. Using self-distillation and word boundary indicators is also beneficial but to a lesser extent. A combination of all the above methods results in a classification accuracy of 93.4%, which is an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mpc001/Lipreading_using_Temporal_Convolutional_Networks
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Subtitles and Audiovisual Media

MethodsMixup