Faster Speech-LLaMA Inference with Multi-token Prediction
Desh Raj, Gil Keren, Junteng Jia, Jay Mahadeokar, Ozlem Kalinli

TL;DR
This paper introduces a method to accelerate Speech-LLaMA inference by predicting multiple tokens simultaneously, reducing decoder calls by over three times while maintaining or improving speech recognition accuracy.
Contribution
It proposes novel model architectures and decoding strategies, including prefix-based beam search, to enable multi-token prediction in Speech-LLaMA, significantly speeding up inference.
Findings
Decoder calls reduced by approximately 3.2x
Maintains or improves word error rate (WER) performance
Effective multi-token prediction strategies demonstrated
Abstract
Large language models (LLMs) have become proficient at solving a wide variety of tasks, including those involving multi-modal inputs. In particular, instantiating an LLM (such as LLaMA) with a speech encoder and training it on paired data imparts speech recognition (ASR) abilities to the decoder-only model, hence called Speech-LLaMA. Nevertheless, due to the sequential nature of auto-regressive inference and the relatively large decoder, Speech-LLaMA models require relatively high inference time. In this work, we propose to speed up Speech-LLaMA inference by predicting multiple tokens in the same decoding step. We explore several model architectures that enable this, and investigate their performance using threshold-based and verification-based inference strategies. We also propose a prefix-based beam search decoding method that allows efficient minimum word error rate (MWER) training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
