AMUSD: Asynchronous Multi-Device Speculative Decoding for LLM   Acceleration

Bradley McDanel

arXiv:2410.17375·cs.CL·October 24, 2024

AMUSD: Asynchronous Multi-Device Speculative Decoding for LLM Acceleration

Bradley McDanel

PDF

Open Access 1 Repo

TL;DR

AMUSD introduces an asynchronous, multi-device speculative decoding system that accelerates large language model generation by enabling draft and verify models to operate independently on separate devices, achieving significant speedups without quality loss.

Contribution

This work presents AMUSD, a novel asynchronous multi-device speculative decoding system that decouples draft and verify phases for faster LLM inference.

Findings

01

29% average speedup over speculative decoding

02

Up to 1.96× faster than autoregressive decoding

03

Maintains identical output quality

Abstract

Large language models typically generate tokens autoregressively, using each token as input for the next. Recent work on Speculative Decoding has sought to accelerate this process by employing a smaller, faster draft model to more quickly generate candidate tokens. These candidates are then verified in parallel by the larger (original) verify model, resulting in overall speedup compared to using the larger model by itself in an autoregressive fashion. In this work, we introduce AMUSD (Asynchronous Multi-device Speculative Decoding), a system that further accelerates generation by decoupling the draft and verify phases into a continuous, asynchronous approach. Unlike conventional speculative decoding, where only one model (draft or verify) performs token generation at a time, AMUSD enables both models to perform predictions independently on separate devices (e.g., GPUs). We evaluate our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bradmcdanel/amusd
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParticle accelerators and beam dynamics · Particle Accelerators and Free-Electron Lasers · Magnetic confinement fusion research