Uncovering the Visual Contribution in Audio-Visual Speech Recognition

Zhaofeng Lin; Naomi Harte

arXiv:2412.17129·eess.AS·April 29, 2025

Uncovering the Visual Contribution in Audio-Visual Speech Recognition

Zhaofeng Lin, Naomi Harte

PDF

Open Access

TL;DR

This paper evaluates how much visual cues contribute to audio-visual speech recognition, revealing that current systems do not fully utilize visual information despite performance improvements in noisy environments.

Contribution

It introduces a new perspective by quantifying visual contribution through effective SNR gains and highlights the gap between visual information use and recognition accuracy in AVSR systems.

Findings

01

Visual contribution varies independently of WER.

02

Current AVSR systems underutilize visual cues.

03

Effective SNR gains are not always reflected in low WER.

Abstract

Audio-Visual Speech Recognition (AVSR) combines auditory and visual speech cues to enhance the accuracy and robustness of speech recognition systems. Recent advancements in AVSR have improved performance in noisy environments compared to audio-only counterparts. However, the true extent of the visual contribution, and whether AVSR systems fully exploit the available cues in the visual domain, remains unclear. This paper assesses AVSR systems from a different perspective, by considering human speech perception. We use three systems: Auto-AVSR, AVEC and AV-RelScore. We first quantify the visual contribution using effective SNR gains at 0 dB and then investigate the use of visual information in terms of its temporal distribution and word-level informativeness. We show that low WER does not guarantee high SNR gains. Our results suggest that current methods do not fully exploit visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing