Transducer-Llama: Integrating LLMs into Streamable Transducer-based Speech Recognition
Keqi Deng, Jinxi Guo, Yingyi Ma, Niko Moritz, Philip C. Woodland,, Ozlem Kalinli, Mike Seltzer

TL;DR
This paper introduces Transducer-Llama, a streaming speech recognition model integrating large language models with a novel vocabulary adaptation and weak-to-strong LM swap strategy, achieving significant WER reductions.
Contribution
It presents a new architecture combining LLMs with transducer models for streaming ASR and proposes an efficient vocabulary adaptation and LM swapping technique.
Findings
17% relative WER reduction over baseline
32% relative WER reduction over RNN-T
Effective integration of LLMs into streaming ASR
Abstract
While large language models (LLMs) have been applied to automatic speech recognition (ASR), the task of making the model streamable remains a challenge. This paper proposes a novel model architecture, Transducer-Llama, that integrates LLMs into a Factorized Transducer (FT) model, naturally enabling streaming capabilities. Furthermore, given that the large vocabulary of LLMs can cause data sparsity issue and increased training costs for spoken language systems, this paper introduces an efficient vocabulary adaptation technique to align LLMs with speech system vocabularies. The results show that directly optimizing the FT model with a strong pre-trained LLM-based predictor using the RNN-T loss yields some but limited improvements over a smaller pre-trained LM predictor. Therefore, this paper proposes a weak-to-strong LM swap strategy, using a weak LM predictor during RNN-T loss training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
MethodsALIGN
