AMUSD: Asynchronous Multi-Device Speculative Decoding for LLM Acceleration
Bradley McDanel

TL;DR
AMUSD introduces an asynchronous, multi-device speculative decoding system that accelerates large language model generation by enabling draft and verify models to operate independently on separate devices, achieving significant speedups without quality loss.
Contribution
This work presents AMUSD, a novel asynchronous multi-device speculative decoding system that decouples draft and verify phases for faster LLM inference.
Findings
29% average speedup over speculative decoding
Up to 1.96× faster than autoregressive decoding
Maintains identical output quality
Abstract
Large language models typically generate tokens autoregressively, using each token as input for the next. Recent work on Speculative Decoding has sought to accelerate this process by employing a smaller, faster draft model to more quickly generate candidate tokens. These candidates are then verified in parallel by the larger (original) verify model, resulting in overall speedup compared to using the larger model by itself in an autoregressive fashion. In this work, we introduce AMUSD (Asynchronous Multi-device Speculative Decoding), a system that further accelerates generation by decoupling the draft and verify phases into a continuous, asynchronous approach. Unlike conventional speculative decoding, where only one model (draft or verify) performs token generation at a time, AMUSD enables both models to perform predictions independently on separate devices (e.g., GPUs). We evaluate our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParticle accelerators and beam dynamics · Particle Accelerators and Free-Electron Lasers · Magnetic confinement fusion research
