Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-Person Video
Dmitriy Serdyuk, Otavio Braga, Olivier Siohan

TL;DR
This paper introduces a transformer-based video front-end for audio-visual speech recognition, replacing traditional 3D CNNs, leading to improved accuracy on large-scale YouTube and LRS3-TED datasets, including multi-person scenarios.
Contribution
The work proposes using video transformer networks instead of 3D CNNs for visual feature extraction in AV-ASR, achieving state-of-the-art results.
Findings
Achieved 17.0% WER on LRS3-TED with the proposed model.
Improved recognition accuracy by 10-15% over convolutional baselines.
Reduced error rates in multi-person AV-ASR scenarios.
Abstract
Audio-visual automatic speech recognition (AV-ASR) extends speech recognition by introducing the video modality as an additional source of information. In this work, the information contained in the motion of the speaker's mouth is used to augment the audio features. The video modality is traditionally processed with a 3D convolutional neural network (e.g. 3D version of VGG). Recently, image transformer networks arXiv:2010.11929 demonstrated the ability to extract rich visual features for image classification tasks. Here, we propose to replace the 3D convolution with a video transformer to extract visual features. We train our baselines and the proposed model on a large scale corpus of YouTube videos. The performance of our approach is evaluated on a labeled subset of YouTube videos as well as on the LRS3-TED public corpus. Our best video-only model obtains 31.4% WER on YTDEV18 and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
Methods3D Convolution · Convolution
