Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition

Srijith Radhakrishnan; Chao-Han Huck Yang; Sumeer Ahmad Khan; Rohit Kumar; Narsis A. Kiani; David Gomez-Cabrero; Jesper N. Tegner

arXiv:2310.06434·cs.CL·December 2, 2025

Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition

Srijith Radhakrishnan, Chao-Han Huck Yang, Sumeer Ahmad Khan, Rohit Kumar, Narsis A. Kiani, David Gomez-Cabrero, Jesper N. Tegner

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper presents a novel cross-modal fusion framework for automatic speech recognition error correction that combines acoustic and linguistic information, significantly improving word error rates over traditional methods.

Contribution

It introduces a new generative error correction approach leveraging cross-modal fusion with pre-trained models, surpassing existing ranking-based rescoring techniques.

Findings

01

Achieved a 37.66% relative improvement in word error rate.

02

Demonstrated stability and reproducibility across diverse datasets.

03

Open-sourced code and models for future research.

Abstract

We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR). Our methodology leverages both acoustic information and external linguistic representations to generate accurate speech transcription contexts. This marks a step towards a fresh paradigm in generative error correction within the realm of n-best hypotheses. Unlike the existing ranking-based rescoring methods, our approach adeptly uses distinct initialization techniques and parameter-efficient algorithms to boost ASR performance derived from pre-trained speech and text models. Through evaluation across diverse ASR datasets, we evaluate the stability and reproducibility of our fusion technique, demonstrating its improved word error rate relative (WERR) performance in comparison to n-best hypotheses by relatively 37.66%. To encourage future research, we have made…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

srijith-rkr/whispering-llama
pytorchOfficial

Models

🤗
Srijith-rkr/Whispering-LLaMA
model· ♡ 8
♡ 8

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing