Accelerating Production LLMs with Combined Token/Embedding Speculators

Davis Wertheimer; Joshua Rosenkranz; Thomas Parnell; Sahil Suneja,; Pavithra Ranganathan; Raghu Ganti; Mudhakar Srivatsa

arXiv:2404.19124·cs.CL·June 10, 2024

Accelerating Production LLMs with Combined Token/Embedding Speculators

Davis Wertheimer, Joshua Rosenkranz, Thomas Parnell, Sahil Suneja,, Pavithra Ranganathan, Raghu Ganti, Mudhakar Srivatsa

PDF

Open Access 1 Repo

TL;DR

This paper introduces novel speculative decoding models that predict multiple tokens simultaneously, significantly speeding up large language model inference in production settings by 2-3 times.

Contribution

It presents a new approach to speculative decoding that conditions on context and sampled tokens to efficiently predict high-quality n-grams, enabling faster inference.

Findings

01

Achieved 2-3x inference speedup on large language models.

02

Demonstrated effective training of speculative draft models for production.

03

Outlined future directions for further improvements.

Abstract

This technical report describes the design and training of novel speculative decoding draft models, for accelerating the inference speeds of large language models in a production environment. By conditioning draft predictions on both context vectors and sampled tokens, we can train our speculators to efficiently predict high-quality n-grams, which the base model then accepts or rejects. This allows us to effectively predict multiple tokens per inference forward pass, accelerating wall-clock inference speeds of highly optimized base model implementations by a factor of 2-3x. We explore these initial results and describe next steps for further improvements.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

foundation-model-stack/fms-fsdp
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScheduling and Optimization Algorithms · Advanced Manufacturing and Logistics Optimization · Digital Rights Management and Security

MethodsBalanced Selection