Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep   Visual Speech Recognition

Yuanhang Zhang; Shuang Yang; Jingyun Xiao; Shiguang Shan; Xilin Chen

arXiv:2003.03206·cs.CV·March 10, 2020·6 cites

Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition

Yuanhang Zhang, Shuang Yang, Jingyun Xiao, Shiguang Shan, Xilin Chen

PDF

Open Access 1 Repo

TL;DR

This paper investigates whether visual speech recognition models can benefit from analyzing facial regions beyond the lips, such as the whole face or cheeks, and introduces a simple method to enhance feature learning, leading to improved performance.

Contribution

The study demonstrates that incorporating extraoral facial regions improves VSR accuracy and proposes a Cutout-based technique to learn more discriminative features from various facial areas.

Findings

01

Extraoral facial regions enhance VSR performance.

02

Using the upper face or cheeks benefits recognition accuracy.

03

Cutout-based training improves feature discrimination.

Abstract

Recent advances in deep learning have heightened interest among researchers in the field of visual speech recognition (VSR). Currently, most existing methods equate VSR with automatic lip reading, which attempts to recognise speech by analysing lip motion. However, human experience and psychological studies suggest that we do not always fix our gaze at each other's lips during a face-to-face conversation, but rather scan the whole face repetitively. This inspires us to revisit a fundamental yet somehow overlooked problem: can VSR models benefit from reading extraoral facial regions, i.e. beyond the lips? In this paper, we perform a comprehensive study to evaluate the effects of different facial regions with state-of-the-art VSR models, including the mouth, the whole face, the upper face, and even the cheeks. Experiments are conducted on both word-level and sentence-level benchmarks with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sailordiary/deep-face-vsr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Indoor and Outdoor Localization Technologies

MethodsCutout