Visual Speech Recognition for Multiple Languages in the Wild

Pingchuan Ma; Stavros Petridis; Maja Pantic

arXiv:2202.13084·cs.CV·November 1, 2022

Visual Speech Recognition for Multiple Languages in the Wild

Pingchuan Ma, Stavros Petridis, Maja Pantic

PDF

Open Access 2 Repos

TL;DR

This paper introduces a novel visual speech recognition model that incorporates auxiliary tasks, hyperparameter tuning, and data augmentation, achieving state-of-the-art results across multiple languages with less data.

Contribution

It demonstrates that model design improvements can significantly outperform larger datasets in visual speech recognition, emphasizing the importance of auxiliary tasks and optimization.

Findings

01

The proposed model outperforms all previous publicly available VSR methods.

02

It surpasses models trained on much larger, non-public datasets.

03

Additional multilingual and automatically transcribed data further enhance performance.

Abstract

Visual speech recognition (VSR) aims to recognize the content of speech based on lip movements, without relying on the audio stream. Advances in deep learning and the availability of large audio-visual datasets have led to the development of much more accurate and robust VSR models than ever before. However, these advances are usually due to the larger training sets rather than the model design. Here we demonstrate that designing better models is equally as important as using larger training sets. We propose the addition of prediction-based auxiliary tasks to a VSR model, and highlight the importance of hyperparameter optimization and appropriate data augmentations. We show that such a model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin. It even outperforms models that were trained on non-publicly available…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing