Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model   in End-to-End Speech Recognition

Yosuke Higuchi; Tetsuji Ogawa; Tetsunori Kobayashi

arXiv:2309.10524·eess.AS·January 8, 2025

Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition

Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel end-to-end speech recognition model that leverages instruction-tuned large language models to improve text generation and correction, resulting in significant reductions in word error rates.

Contribution

It demonstrates how instruction-tuned LLMs can be integrated into ASR systems to enhance linguistic accuracy and overall performance.

Findings

01

Achieved approximately 13% relative WER reduction on major benchmarks.

02

Successfully integrated LLM as a front-end feature extractor for the decoder.

03

Showed that zero-shot instruction tuning improves ASR correction capabilities.

Abstract

We propose to utilize an instruction-tuned large language model (LLM) for guiding the text generation process in automatic speech recognition (ASR). Modern large language models (LLMs) are adept at performing various text generation tasks through zero-shot learning, prompted with instructions designed for specific objectives. This paper explores the potential of LLMs to derive linguistic information that can facilitate text generation in end-to-end ASR models. Specifically, we instruct an LLM to correct grammatical errors in an ASR hypothesis and use the LLM-derived representations to refine the output further. The proposed model is built on the joint CTC and attention architecture, with the LLM serving as a front-end feature extractor for the decoder. The ASR hypothesis, subject to correction, is obtained from the encoder via CTC decoding and fed into the LLM along with a specific…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yosukehiguchi/espnet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling