Uncovering the Visual Contribution in Audio-Visual Speech Recognition
Zhaofeng Lin, Naomi Harte

TL;DR
This paper evaluates how much visual cues contribute to audio-visual speech recognition, revealing that current systems do not fully utilize visual information despite performance improvements in noisy environments.
Contribution
It introduces a new perspective by quantifying visual contribution through effective SNR gains and highlights the gap between visual information use and recognition accuracy in AVSR systems.
Findings
Visual contribution varies independently of WER.
Current AVSR systems underutilize visual cues.
Effective SNR gains are not always reflected in low WER.
Abstract
Audio-Visual Speech Recognition (AVSR) combines auditory and visual speech cues to enhance the accuracy and robustness of speech recognition systems. Recent advancements in AVSR have improved performance in noisy environments compared to audio-only counterparts. However, the true extent of the visual contribution, and whether AVSR systems fully exploit the available cues in the visual domain, remains unclear. This paper assesses AVSR systems from a different perspective, by considering human speech perception. We use three systems: Auto-AVSR, AVEC and AV-RelScore. We first quantify the visual contribution using effective SNR gains at 0 dB and then investigate the use of visual information in terms of its temporal distribution and word-level informativeness. We show that low WER does not guarantee high SNR gains. Our results suggest that current methods do not fully exploit visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing
