Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition
Rui Liu, Hongyu Yuan, Haizhou Li

TL;DR
This paper introduces AVGER, a novel generative error correction method for audio-visual speech recognition that leverages multimodal representations and LLMs to significantly reduce word error rates.
Contribution
It proposes a new AVSR error correction paradigm using multimodal representations and a multi-level training constraint, improving accuracy over existing systems.
Findings
Achieves 24% reduction in Word Error Rate on LRS3 dataset.
Outperforms current mainstream AVSR systems.
Introduces a multimodal correction framework with enhanced interpretability.
Abstract
Unlike traditional Automatic Speech Recognition (ASR), Audio-Visual Speech Recognition (AVSR) takes audio and visual signals simultaneously to infer the transcription. Recent studies have shown that Large Language Models (LLMs) can be effectively used for Generative Error Correction (GER) in ASR by predicting the best transcription from ASR-generated N-best hypotheses. However, these LLMs lack the ability to simultaneously understand audio and visual, making the GER approach challenging to apply in AVSR. In this work, we propose a novel GER paradigm for AVSR, termed AVGER, that follows the concept of ``listening and seeing again''. Specifically, we first use the powerful AVSR system to read the audio and visual signals to get the N-Best hypotheses, and then use the Q-former-based Multimodal Synchronous Encoder to read the audio and visual information again and convert them into an audio…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
MethodsSolana Customer Service Number +1-833-534-1729 · Graph Convolutional Network · Gait Emotion Recognition
