Adapting Large Language Model with Speech for Fully Formatted End-to-End   Speech Recognition

Shaoshi Ling; Yuxuan Hu; Shuangbei Qian; Guoli Ye; Yao Qian; Yifan; Gong; Ed Lin; Michael Zeng

arXiv:2307.08234·eess.AS·August 4, 2023·ICASSP·1 cites

Adapting Large Language Model with Speech for Fully Formatted End-to-End Speech Recognition

Shaoshi Ling, Yuxuan Hu, Shuangbei Qian, Guoli Ye, Yao Qian, Yifan, Gong, Ed Lin, Michael Zeng

PDF

Open Access 1 Repo

TL;DR

This paper proposes an approach to adapt pretrained large language models for end-to-end speech recognition, enhancing transcription readability and outperforming existing models like Whisper across multiple domains.

Contribution

It introduces a novel method for adapting pretrained LLMs to speech, addressing the mismatch issue and improving fully formatted E2E ASR performance.

Findings

01

Outperforms Whisper in recognition error rate

02

Produces more readable transcriptions with punctuation and capitalization

03

Effective across various domains

Abstract

Most end-to-end (E2E) speech recognition models are composed of encoder and decoder blocks that perform acoustic and language modeling functions. Pretrained large language models (LLMs) have the potential to improve the performance of E2E ASR. However, integrating a pretrained language model into an E2E speech recognition model has shown limited benefits due to the mismatches between text-based LLMs and those used in E2E ASR. In this paper, we explore an alternative approach by adapting a pretrained LLMs to speech. Our experiments on fully-formatted E2E ASR transcription tasks across various domains demonstrate that our approach can effectively leverage the strengths of pretrained LLMs to produce more readable ASR transcriptions. Our model, which is based on the pretrained large language models with either an encoder-decoder or decoder-only structure, surpasses strong ASR models such as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

openai/whisper
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques