Listen Again and Choose the Right Answer: A New Paradigm for Automatic   Speech Recognition with Large Language Models

Yuchen Hu; Chen Chen; Chengwei Qin; Qiushi Zhu; Eng Siong Chng; Ruizhe; Li

arXiv:2405.10025·cs.CL·May 17, 2024

Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models

Yuchen Hu, Chen Chen, Chengwei Qin, Qiushi Zhu, Eng Siong Chng, Ruizhe, Li

PDF

Open Access

TL;DR

This paper introduces ClozeGER, a novel approach for automatic speech recognition error correction that incorporates source speech into large language models and reformulates the task as a cloze test, significantly improving accuracy.

Contribution

The paper proposes ClozeGER, combining multimodal LLMs with a reformulated cloze test paradigm to address limitations of existing generative error correction methods in ASR.

Findings

01

ClozeGER outperforms vanilla GER on 9 ASR datasets.

02

Incorporating source speech improves correction fidelity.

03

Reformulating GER as a cloze test simplifies the correction process.

Abstract

Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which aims to predict the ground-truth transcription from the decoded N-best hypotheses. Thanks to the strong language generation ability of LLMs and rich information in the N-best list, GER shows great effectiveness in enhancing ASR results. However, it still suffers from two limitations: 1) LLMs are unaware of the source speech during GER, which may lead to results that are grammatically correct but violate the source speech content, 2) N-best hypotheses usually only vary in a few tokens, making it redundant to send all of them for GER, which could confuse LLM about which tokens to focus on and thus lead to increased miscorrection. In this paper, we propose ClozeGER, a new paradigm for ASR generative error correction. First, we introduce a multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsGraph Convolutional Network · Solana Customer Service Number +1-833-534-1729 · Gait Emotion Recognition · Focus