Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition   for Single and Multi-Person Video

Dmitriy Serdyuk; Otavio Braga; Olivier Siohan

arXiv:2201.10439·cs.CV·November 2, 2022·6 cites

Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-Person Video

Dmitriy Serdyuk, Otavio Braga, Olivier Siohan

PDF

Open Access

TL;DR

This paper introduces a transformer-based video front-end for audio-visual speech recognition, replacing traditional 3D CNNs, leading to improved accuracy on large-scale YouTube and LRS3-TED datasets, including multi-person scenarios.

Contribution

The work proposes using video transformer networks instead of 3D CNNs for visual feature extraction in AV-ASR, achieving state-of-the-art results.

Findings

01

Achieved 17.0% WER on LRS3-TED with the proposed model.

02

Improved recognition accuracy by 10-15% over convolutional baselines.

03

Reduced error rates in multi-person AV-ASR scenarios.

Abstract

Audio-visual automatic speech recognition (AV-ASR) extends speech recognition by introducing the video modality as an additional source of information. In this work, the information contained in the motion of the speaker's mouth is used to augment the audio features. The video modality is traditionally processed with a 3D convolutional neural network (e.g. 3D version of VGG). Recently, image transformer networks arXiv:2010.11929 demonstrated the ability to extract rich visual features for image classification tasks. Here, we propose to replace the 3D convolution with a video transformer to extract visual features. We train our baselines and the proposed model on a large scale corpus of YouTube videos. The performance of our approach is evaluated on a labeled subset of YouTube videos as well as on the LRS3-TED public corpus. Our best video-only model obtains 31.4% WER on YTDEV18 and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis

Methods3D Convolution · Convolution