Learning Contextually Fused Audio-visual Representations for   Audio-visual Speech Recognition

Zi-Qiang Zhang; Jie Zhang; Jian-Shu Zhang; Ming-Hui Wu; Xin Fang,; Li-Rong Dai

arXiv:2202.07428·eess.IV·July 12, 2022

Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition

Zi-Qiang Zhang, Jie Zhang, Jian-Shu Zhang, Ming-Hui Wu, Xin Fang,, Li-Rong Dai

PDF

Open Access

TL;DR

This paper introduces a transformer-based approach for learning robust audio-visual speech representations using self-supervised methods, enhancing speech recognition and lipreading by exploiting multi-modal complementarity and long-term context.

Contribution

It proposes a novel audio-visual representation learning method that leverages a transformer fusion module and flexible masking, enabling effective multi-modal and single-modal speech recognition.

Findings

01

Improved performance on speech recognition tasks.

02

Effective fusion of audio and visual modalities.

03

Versatility in single-modal applications.

Abstract

With the advance in self-supervised learning for audio and visual modalities, it has become possible to learn a robust audio-visual speech representation. This would be beneficial for improving the audio-visual speech recognition (AVSR) performance, as the multi-modal inputs contain more fruitful information in principle. In this paper, based on existing self-supervised representation learning methods for audio modality, we therefore propose an audio-visual representation learning approach. The proposed approach explores both the complementarity of audio-visual modalities and long-term context dependency using a transformer-based fusion module and a flexible masking strategy. After pre-training, the model is able to extract fused representations required by AVSR. Without loss of generality, it can be applied to single-modal tasks, e.g. audio/visual speech recognition by simply masking…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation