Revisiting ASR Error Correction with Specialized Models

Zijin Gu; Tatiana Likhomanenko; He Bai; Erik McDermott; Ronan Collobert; Navdeep Jaitly

arXiv:2405.15216·cs.LG·March 18, 2026

Revisiting ASR Error Correction with Specialized Models

Zijin Gu, Tatiana Likhomanenko, He Bai, Erik McDermott, Ronan Collobert, Navdeep Jaitly

PDF

Open Access

TL;DR

This paper introduces a compact seq2seq model for ASR error correction that outperforms large language models in accuracy, generalizes across architectures and domains, and reduces latency and hallucination issues.

Contribution

The paper presents a new, efficient seq2seq model trained on diverse synthetic and real errors, with a correction-first decoding approach that surpasses large language models in ASR error correction.

Findings

01

Achieves 1.5/3.3% WER on LibriSpeech test sets.

02

Outperforms large language models in correction accuracy.

03

Generalizes across different ASR architectures and domains.

Abstract

Language models play a central role in automatic speech recognition (ASR), yet most methods rely on text-only models unaware of ASR error patterns. Recently, large language models (LLMs) have been applied to ASR correction, but introduce latency and hallucination concerns. We revisit ASR error correction with compact seq2seq models, trained on ASR errors from real and synthetic audio. To scale training, we construct synthetic corpora via cascaded TTS and ASR, finding that matching the diversity of realistic error distributions is key. We propose correction-first decoding, where the correction model generates candidates rescored using ASR acoustic scores. With 15x fewer parameters than LLMs, our model achieves 1.5/3.3% WER on LibriSpeech test-clean/other, outperforms LLMs, generalizes across ASR architectures (CTC, Seq2seq, Transducer) and diverse domains, and provides precise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing