Transducer-Llama: Integrating LLMs into Streamable Transducer-based   Speech Recognition

Keqi Deng; Jinxi Guo; Yingyi Ma; Niko Moritz; Philip C. Woodland,; Ozlem Kalinli; Mike Seltzer

arXiv:2412.16464·cs.CL·December 24, 2024

Transducer-Llama: Integrating LLMs into Streamable Transducer-based Speech Recognition

Keqi Deng, Jinxi Guo, Yingyi Ma, Niko Moritz, Philip C. Woodland,, Ozlem Kalinli, Mike Seltzer

PDF

Open Access

TL;DR

This paper introduces Transducer-Llama, a streaming speech recognition model integrating large language models with a novel vocabulary adaptation and weak-to-strong LM swap strategy, achieving significant WER reductions.

Contribution

It presents a new architecture combining LLMs with transducer models for streaming ASR and proposes an efficient vocabulary adaptation and LM swapping technique.

Findings

01

17% relative WER reduction over baseline

02

32% relative WER reduction over RNN-T

03

Effective integration of LLMs into streaming ASR

Abstract

While large language models (LLMs) have been applied to automatic speech recognition (ASR), the task of making the model streamable remains a challenge. This paper proposes a novel model architecture, Transducer-Llama, that integrates LLMs into a Factorized Transducer (FT) model, naturally enabling streaming capabilities. Furthermore, given that the large vocabulary of LLMs can cause data sparsity issue and increased training costs for spoken language systems, this paper introduces an efficient vocabulary adaptation technique to align LLMs with speech system vocabularies. The results show that directly optimizing the FT model with a strong pre-trained LLM-based predictor using the RNN-T loss yields some but limited improvements over a smaller pre-trained LM predictor. Therefore, this paper proposes a weak-to-strong LM swap strategy, using a weak LM predictor during RNN-T loss training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing

MethodsALIGN