Audio-Visual Speech Recognition is Worth 32$\times$32$\times$8 Voxels

Dmitriy Serdyuk; Otavio Braga; Olivier Siohan

arXiv:2109.09536·cs.CV·September 21, 2021

Audio-Visual Speech Recognition is Worth 32$\times$32$\times$8 Voxels

Dmitriy Serdyuk, Otavio Braga, Olivier Siohan

PDF

Open Access

TL;DR

This paper demonstrates that replacing traditional 3D convolutional visual feature extractors with video transformers in audio-visual speech recognition systems can achieve superior or comparable performance, indicating the viability of convolution-free models.

Contribution

The work introduces a video transformer front-end for AV-ASR, replacing convolutional networks, and shows its effectiveness on large-scale datasets, matching or surpassing state-of-the-art results.

Findings

01

Transformer front-end outperforms convolutional baseline in lip-reading.

02

Transformer-based AV-ASR matches or exceeds convolutional models.

03

Fine-tuning achieves state-of-the-art performance on LRS3-TED.

Abstract

Audio-visual automatic speech recognition (AV-ASR) introduces the video modality into the speech recognition process, often by relying on information conveyed by the motion of the speaker's mouth. The use of the video signal requires extracting visual features, which are then combined with the acoustic features to build an AV-ASR system [1]. This is traditionally done with some form of 3D convolutional network (e.g. VGG) as widely used in the computer vision community. Recently, image transformers [2] have been introduced to extract visual features useful for image classification tasks. In this work, we propose to replace the 3D convolutional visual front-end with a video transformer front-end. We train our systems on a large-scale dataset composed of YouTube videos and evaluate performance on the publicly available LRS3-TED set, as well as on a large set of YouTube videos. On a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Advanced Data Compression Techniques