Cross-Modal Global Interaction and Local Alignment for Audio-Visual   Speech Recognition

Yuchen Hu; Ruizhe Li; Chen Chen; Heqing Zou; Qiushi Zhu; Eng Siong; Chng

arXiv:2305.09212·eess.AS·May 17, 2023·2 cites

Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition

Yuchen Hu, Ruizhe Li, Chen Chen, Heqing Zou, Qiushi Zhu, Eng Siong, Chng

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel cross-modal global interaction and local alignment approach for audio-visual speech recognition, capturing deep correlations between modalities to improve noise robustness and overall performance.

Contribution

It proposes a comprehensive method that models both global and local audio-visual correlations, surpassing previous simple fusion techniques in AVSR.

Findings

01

Outperforms state-of-the-art on LRS3 and LRS2 benchmarks.

02

Enhances noise robustness in speech recognition.

03

Provides a holistic view of cross-modal correlations.

Abstract

Audio-visual speech recognition (AVSR) research has gained a great success recently by improving the noise-robustness of audio-only automatic speech recognition (ASR) with noise-invariant visual information. However, most existing AVSR approaches simply fuse the audio and visual features by concatenation, without explicit interactions to capture the deep correlations between them, which results in sub-optimal multimodal representations for downstream speech recognition task. In this paper, we propose a cross-modal global interaction and local alignment (GILA) approach for AVSR, which captures the deep audio-visual (A-V) correlations from both global and local perspectives. Specifically, we design a global interaction model to capture the A-V complementary relationship on modality level, as well as a local alignment approach to model the A-V temporal consistency on frame level. Such a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuchen005/gila
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation