EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating   Large Language Models

Yunsheng Ni; Chuanjian Liu; Yehui Tang; Kai Han; Yunhe Wang

arXiv:2405.07542·cs.CL·October 15, 2024

EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models

Yunsheng Ni, Chuanjian Liu, Yehui Tang, Kai Han, Yunhe Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces EMS-SD, a novel multi-sample speculative decoding method that improves inference speed for large language models by resolving token acceptance inconsistencies without extra overhead.

Contribution

The paper presents a new approach to multi-sample speculative decoding that eliminates the need for padding tokens, reducing computational and memory overhead.

Findings

01

Significant speedup in LLM inference demonstrated

02

Reduced computational overhead compared to vanilla methods

03

Effective handling of token acceptance inconsistencies

Abstract

Speculative decoding emerges as a pivotal technique for enhancing the inference speed of Large Language Models (LLMs). Despite recent research aiming to improve prediction efficiency, multi-sample speculative decoding has been overlooked due to varying numbers of accepted tokens within a batch in the verification phase. Vanilla method adds padding tokens in order to ensure that the number of new tokens remains consistent across samples. However, this increases the computational and memory access overhead, thereby reducing the speedup ratio. We propose a novel method that can resolve the issue of inconsistent tokens accepted by different samples without necessitating an increase in memory or computing overhead. Furthermore, our proposed method can handle the situation where the prediction tokens of different samples are inconsistent without the need to add padding tokens. Sufficient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

niyunsheng/ems-sd
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings