Listening and Seeing Again: Generative Error Correction for Audio-Visual   Speech Recognition

Rui Liu; Hongyu Yuan; Haizhou Li

arXiv:2501.04038·cs.MM·January 9, 2025

Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition

Rui Liu, Hongyu Yuan, Haizhou Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces AVGER, a novel generative error correction method for audio-visual speech recognition that leverages multimodal representations and LLMs to significantly reduce word error rates.

Contribution

It proposes a new AVSR error correction paradigm using multimodal representations and a multi-level training constraint, improving accuracy over existing systems.

Findings

01

Achieves 24% reduction in Word Error Rate on LRS3 dataset.

02

Outperforms current mainstream AVSR systems.

03

Introduces a multimodal correction framework with enhanced interpretability.

Abstract

Unlike traditional Automatic Speech Recognition (ASR), Audio-Visual Speech Recognition (AVSR) takes audio and visual signals simultaneously to infer the transcription. Recent studies have shown that Large Language Models (LLMs) can be effectively used for Generative Error Correction (GER) in ASR by predicting the best transcription from ASR-generated N-best hypotheses. However, these LLMs lack the ability to simultaneously understand audio and visual, making the GER approach challenging to apply in AVSR. In this work, we propose a novel GER paradigm for AVSR, termed AVGER, that follows the concept of ``listening and seeing again''. Specifically, we first use the powerful AVSR system to read the audio and visual signals to get the N-Best hypotheses, and then use the Q-former-based Multimodal Synchronous Encoder to read the audio and visual information again and convert them into an audio…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

circleredrain/avger
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing

MethodsSolana Customer Service Number +1-833-534-1729 · Graph Convolutional Network · Gait Emotion Recognition