LipGER: Visually-Conditioned Generative Error Correction for Robust   Automatic Speech Recognition

Sreyan Ghosh; Sonal Kumar; Ashish Seth; Purva Chiniya and; Utkarsh Tyagi; Ramani Duraiswami; Dinesh Manocha

arXiv:2406.04432·eess.AS·June 10, 2024·1 cites

LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition

Sreyan Ghosh, Sonal Kumar, Ashish Seth, Purva Chiniya and, Utkarsh Tyagi, Ramani Duraiswami, Dinesh Manocha

PDF

Open Access 1 Repo

TL;DR

LipGER introduces a novel visually-conditioned generative framework using large language models to improve noise-robust automatic speech recognition by leveraging lip motion cues and addressing data limitations.

Contribution

It proposes a new LLM-based approach for visually-conditioned ASR error correction that overcomes data scarcity and domain adaptation challenges.

Findings

01

LipGER improves Word Error Rate by up to 49.2%.

02

It demonstrates effectiveness across four diverse datasets.

03

The release of LipHyp dataset supports further research.

Abstract

Visual cues, like lip motion, have been shown to improve the performance of Automatic Speech Recognition (ASR) systems in noisy environments. We propose LipGER (Lip Motion aided Generative Error Correction), a novel framework for leveraging visual cues for noise-robust ASR. Instead of learning the cross-modal correlation between the audio and visual modalities, we make an LLM learn the task of visually-conditioned (generative) ASR error correction. Specifically, we instruct an LLM to predict the transcription from the N-best hypotheses generated using ASR beam-search. This is further conditioned on lip motions. This approach addresses key challenges in traditional AVSR learning, such as the lack of large-scale paired datasets and difficulties in adapting to new domains. We experiment on 4 datasets in various settings and show that LipGER improves the Word Error Rate in the range of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sreyan88/lipger
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing