Speculative Decoding via Early-exiting for Faster LLM Inference with   Thompson Sampling Control Mechanism

Jiahao Liu; Qifan Wang; Jingang Wang; Xunliang Cai

arXiv:2406.03853·cs.CL·June 7, 2024

Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism

Jiahao Liu, Qifan Wang, Jingang Wang, Xunliang Cai

PDF

Open Access

TL;DR

This paper introduces Early-exiting Speculative Decoding with Thompson Sampling control to accelerate large language model inference, reducing costs while maintaining output quality through a novel draft validation process.

Contribution

The paper presents a new early-exiting speculative decoding method combined with Thompson Sampling for dynamic draft token regulation, improving inference speed and efficiency.

Findings

01

Decodes tokens significantly faster than previous methods.

02

Maintains output quality comparable to standard decoding.

03

Effective on 13B and 70B LLMs.

Abstract

The recent advancements in large language models (LLMs) have been extraordinary, yet the escalating inference costs associated with them present challenges in real-world applications. To address these challenges, we propose a novel approach called Early-exiting Speculative Decoding (EESD) with lossless acceleration. Specifically, EESD utilizes a segment of the LLM to generate draft tokens, incorporating Early-exiting structures after the first N layers. To enhance the quality of draft tokens, a self-distillation method is integrated. This early-exiting design not only reduces deployment and training costs but also significantly accelerates the token generation speed. Moreover, we introduce a novel sampling mechanism that leverages Thompson Sampling to regulate the generation processes, automatically determining the quantity of draft tokens in each round. The original LLM is then…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Algorithms and Data Compression