Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition
Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi

TL;DR
This paper introduces a novel end-to-end speech recognition model that leverages instruction-tuned large language models to improve text generation and correction, resulting in significant reductions in word error rates.
Contribution
It demonstrates how instruction-tuned LLMs can be integrated into ASR systems to enhance linguistic accuracy and overall performance.
Findings
Achieved approximately 13% relative WER reduction on major benchmarks.
Successfully integrated LLM as a front-end feature extractor for the decoder.
Showed that zero-shot instruction tuning improves ASR correction capabilities.
Abstract
We propose to utilize an instruction-tuned large language model (LLM) for guiding the text generation process in automatic speech recognition (ASR). Modern large language models (LLMs) are adept at performing various text generation tasks through zero-shot learning, prompted with instructions designed for specific objectives. This paper explores the potential of LLMs to derive linguistic information that can facilitate text generation in end-to-end ASR models. Specifically, we instruct an LLM to correct grammatical errors in an ASR hypothesis and use the LLM-derived representations to refine the output further. The proposed model is built on the joint CTC and attention architecture, with the LLM serving as a front-end feature extractor for the decoder. The ASR hypothesis, subject to correction, is obtained from the encoder via CTC decoding and fed into the LLM along with a specific…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
