End-to-End Speech Recognition Contextualization with Large Language   Models

Egor Lakomkin; Chunyang Wu; Yassir Fathullah; Ozlem Kalinli; Michael; L. Seltzer; Christian Fuegen

arXiv:2309.10917·eess.AS·September 21, 2023

End-to-End Speech Recognition Contextualization with Large Language Models

Egor Lakomkin, Chunyang Wu, Yassir Fathullah, Ozlem Kalinli, Michael, L. Seltzer, Christian Fuegen

PDF

Open Access

TL;DR

This paper presents a novel method that leverages large language models to improve speech recognition by incorporating contextual information, resulting in significant WER reductions and enhanced recognition of rare words.

Contribution

The paper introduces a mixed-modal language modeling approach that uses pretrained LLMs with minimal additional parameters to contextualize speech recognition systems effectively.

Findings

01

6% WER reduction with textual context

02

7.5% overall WER improvement over baseline

03

17% WER improvement on rare words

Abstract

In recent years, Large Language Models (LLMs) have garnered significant attention from the research community due to their exceptional performance and generalization capabilities. In this paper, we introduce a novel method for contextualizing speech recognition models incorporating LLMs. Our approach casts speech recognition as a mixed-modal language modeling task based on a pretrained LLM. We provide audio features, along with optional text tokens for context, to train the system to complete transcriptions in a decoder-only fashion. As a result, the system is implicitly incentivized to learn how to leverage unstructured contextual information during training. Our empirical results demonstrate a significant improvement in performance, with a 6% WER reduction when additional textual context is provided. Moreover, we find that our method performs competitively and improve by 7.5% WER…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling