End-to-End Speech Recognition Contextualization with Large Language Models
Egor Lakomkin, Chunyang Wu, Yassir Fathullah, Ozlem Kalinli, Michael, L. Seltzer, Christian Fuegen

TL;DR
This paper presents a novel method that leverages large language models to improve speech recognition by incorporating contextual information, resulting in significant WER reductions and enhanced recognition of rare words.
Contribution
The paper introduces a mixed-modal language modeling approach that uses pretrained LLMs with minimal additional parameters to contextualize speech recognition systems effectively.
Findings
6% WER reduction with textual context
7.5% overall WER improvement over baseline
17% WER improvement on rare words
Abstract
In recent years, Large Language Models (LLMs) have garnered significant attention from the research community due to their exceptional performance and generalization capabilities. In this paper, we introduce a novel method for contextualizing speech recognition models incorporating LLMs. Our approach casts speech recognition as a mixed-modal language modeling task based on a pretrained LLM. We provide audio features, along with optional text tokens for context, to train the system to complete transcriptions in a decoder-only fashion. As a result, the system is implicitly incentivized to learn how to leverage unstructured contextual information during training. Our empirical results demonstrate a significant improvement in performance, with a 6% WER reduction when additional textual context is provided. Moreover, we find that our method performs competitively and improve by 7.5% WER…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
