Faster Speech-LLaMA Inference with Multi-token Prediction

Desh Raj; Gil Keren; Junteng Jia; Jay Mahadeokar; Ozlem Kalinli

arXiv:2409.08148·eess.AS·September 13, 2024

Faster Speech-LLaMA Inference with Multi-token Prediction

Desh Raj, Gil Keren, Junteng Jia, Jay Mahadeokar, Ozlem Kalinli

PDF

Open Access

TL;DR

This paper introduces a method to accelerate Speech-LLaMA inference by predicting multiple tokens simultaneously, reducing decoder calls by over three times while maintaining or improving speech recognition accuracy.

Contribution

It proposes novel model architectures and decoding strategies, including prefix-based beam search, to enable multi-token prediction in Speech-LLaMA, significantly speeding up inference.

Findings

01

Decoder calls reduced by approximately 3.2x

02

Maintains or improves word error rate (WER) performance

03

Effective multi-token prediction strategies demonstrated

Abstract

Large language models (LLMs) have become proficient at solving a wide variety of tasks, including those involving multi-modal inputs. In particular, instantiating an LLM (such as LLaMA) with a speech encoder and training it on paired data imparts speech recognition (ASR) abilities to the decoder-only model, hence called Speech-LLaMA. Nevertheless, due to the sequential nature of auto-regressive inference and the relatively large decoder, Speech-LLaMA models require relatively high inference time. In this work, we propose to speed up Speech-LLaMA inference by predicting multiple tokens in the same decoding step. We explore several model architectures that enable this, and investigate their performance using threshold-based and verification-based inference strategies. We also propose a prefix-based beam search decoding method that allows efficient minimum word error rate (MWER) training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings