SpecHub: Provable Acceleration to Multi-Draft Speculative Decoding

Ryan Sun; Tianyi Zhou; Xun Chen; Lichao Sun

arXiv:2411.05289·cs.CL·November 11, 2024

SpecHub: Provable Acceleration to Multi-Draft Speculative Decoding

Ryan Sun, Tianyi Zhou, Xun Chen, Lichao Sun

PDF

Open Access 1 Repo

TL;DR

SpecHub introduces an efficient, provably accelerated sampling method for Multi-Draft Speculative Decoding in large language models, significantly improving inference speed with minimal computational overhead.

Contribution

It presents a novel linear programming-based approach that enhances acceptance rates in MDSD, outperforming heuristic methods like RRS with lower computational costs.

Findings

01

Spechub generates more tokens per step than RRS.

02

Spechub maintains high acceptance rates with linear overhead.

03

Spechub accelerates inference in large language models.

Abstract

Large Language Models (LLMs) have become essential in advancing natural language processing (NLP) tasks, but their sequential token generation limits inference speed. Multi-Draft Speculative Decoding (MDSD) offers a promising solution by using a smaller draft model to generate multiple token sequences, which the target LLM verifies in parallel. However, current heuristic approaches, such as Recursive Rejection Sampling (RRS), suffer from low acceptance rates in subsequent drafts, limiting the advantages of using multiple drafts. Meanwhile, Optimal Transport with Membership Cost (OTM) can theoretically improve acceptance rates, but its computational cost is too high for real-time use. We present SpecHub, a novel, efficient sampling-verification method for MDSD that improves acceptance rates with only linear computational overhead. By simplifying the OTM problem into a compact Linear…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mastergodzilla/speculative_decoding_ot
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsOptimal Transport Modeling