Conformers are All You Need for Visual Speech Recognition

Oscar Chang; Hank Liao; Dmitriy Serdyuk; Ankit Shah; Olivier Siohan

arXiv:2302.10915·cs.LG·December 14, 2023·1 cites

Conformers are All You Need for Visual Speech Recognition

Oscar Chang, Hank Liao, Dmitriy Serdyuk, Ankit Shah, Olivier Siohan

PDF

Open Access

TL;DR

This paper demonstrates that a simple linear visual front-end combined with a large Conformer encoder achieves state-of-the-art results in visual speech recognition, challenging the need for complex front-end features.

Contribution

It shows that complex visual front-ends are unnecessary, and a linear front-end with a larger Conformer encoder improves efficiency and accuracy in visual speech recognition.

Findings

01

Achieved 12.8% WER on TED LRS3 dataset.

02

Linear front-end with larger Conformer outperforms complex front-ends.

03

State-of-the-art performance rivals audio-only models from four years ago.

Abstract

Visual speech recognition models extract visual features in a hierarchical manner. At the lower level, there is a visual front-end with a limited temporal receptive field that processes the raw pixels depicting the lips or faces. At the higher level, there is an encoder that attends to the embeddings produced by the front-end over a large temporal receptive field. Previous work has focused on improving the visual front-end of the model to extract more useful features for speech recognition. Surprisingly, our work shows that complex visual front-ends are not necessary. Instead of allocating resources to a sophisticated visual front-end, we find that a linear visual front-end paired with a larger Conformer encoder results in lower latency, more efficient memory usage, and improved WER performance. We achieve a new state-of-the-art of 12.8% WER for visual speech recognition on the TED LRS3…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Indoor and Outdoor Localization Technologies · Face recognition and analysis