SwiftSpec: Ultra-Low Latency LLM Decoding by Scaling Asynchronous Speculative Decoding

Ziyi Zhang; Ziheng Jiang; Chengquan Jiang; Menghan Yu; Size Zheng; Haibin Lin; Henry Hoffmann; Xin Liu

arXiv:2506.11309·cs.DC·June 16, 2025

SwiftSpec: Ultra-Low Latency LLM Decoding by Scaling Asynchronous Speculative Decoding

Ziyi Zhang, Ziheng Jiang, Chengquan Jiang, Menghan Yu, Size Zheng, Haibin Lin, Henry Hoffmann, Xin Liu

PDF

Open Access

TL;DR

SwiftSpec introduces an asynchronous, scalable speculative decoding system that significantly reduces latency in large language model outputs, enabling faster and more efficient real-time applications.

Contribution

It redesigns speculative decoding with asynchronous, disaggregated components, and proposes new techniques for parallel tree generation and cache management to achieve ultra-low latency.

Findings

01

Achieves 1.75x speedup over previous systems

02

Serves Llama3-70B at 348 tokens/sec on 8 GPUs

03

Fastest known low-latency LLM serving at this scale

Abstract

Low-latency decoding for large language models (LLMs) is crucial for applications like chatbots and code assistants, yet generating long outputs remains slow in single-query settings. Prior work on speculative decoding (which combines a small draft model with a larger target model) and tensor parallelism has each accelerated decoding. However, conventional approaches fail to apply both simultaneously due to imbalanced compute requirements (between draft and target models), KV-cache inconsistencies, and communication overheads under small-batch tensor-parallelism. This paper introduces SwiftSpec, a system that targets ultra-low latency for LLM decoding. SwiftSpec redesigns the speculative decoding pipeline in an asynchronous and disaggregated manner, so that each component can be scaled flexibly and remove draft overhead from the critical path. To realize this design, SwiftSpec proposes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Coding theory and cryptography · Handwritten Text Recognition Techniques