LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition
Sreyan Ghosh, Sonal Kumar, Ashish Seth, Purva Chiniya and, Utkarsh Tyagi, Ramani Duraiswami, Dinesh Manocha

TL;DR
LipGER introduces a novel visually-conditioned generative framework using large language models to improve noise-robust automatic speech recognition by leveraging lip motion cues and addressing data limitations.
Contribution
It proposes a new LLM-based approach for visually-conditioned ASR error correction that overcomes data scarcity and domain adaptation challenges.
Findings
LipGER improves Word Error Rate by up to 49.2%.
It demonstrates effectiveness across four diverse datasets.
The release of LipHyp dataset supports further research.
Abstract
Visual cues, like lip motion, have been shown to improve the performance of Automatic Speech Recognition (ASR) systems in noisy environments. We propose LipGER (Lip Motion aided Generative Error Correction), a novel framework for leveraging visual cues for noise-robust ASR. Instead of learning the cross-modal correlation between the audio and visual modalities, we make an LLM learn the task of visually-conditioned (generative) ASR error correction. Specifically, we instruct an LLM to predict the transcription from the N-best hypotheses generated using ASR beam-search. This is further conditioned on lip motions. This approach addresses key challenges in traditional AVSR learning, such as the lack of large-scale paired datasets and difficulties in adapting to new domains. We experiment on 4 datasets in various settings and show that LipGER improves the Word Error Rate in the range of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
