BASS: Batched Attention-optimized Speculative Sampling
Haifeng Qian, Sujan Kumar Gonugondla, Sungsoo Ha, Mingyue Shang,, Sanjay Krishna Gouda, Ramesh Nallapati, Sudipta Sengupta, Xiaofei Ma, Anoop, Deoras

TL;DR
This paper introduces BASS, a batched speculative decoding system that significantly improves multi-sequence generation latency and GPU utilization, enabling faster and more efficient large language model responses.
Contribution
BASS is the first system to effectively perform batched speculative decoding, achieving state-of-the-art latency and GPU efficiency in multi-sequence generation.
Findings
Achieves 5.8ms per token for 7.8B model on A100 GPU
Provides 2.15X speed-up over optimized regular decoding
Reaches 15.8% GPU utilization during decoding
Abstract
Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting large language models. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications often require multiple responses and how to perform speculative decoding in a batched setting while preserving its latency benefits poses non-trivial challenges. This paper describes a system of batched speculative decoding that sets a new state of the art in multi-sequence generation latency and that demonstrates superior GPU utilization as well as quality of generations within a time budget. For example, for a 7.8B-size model on a single A100 GPU and with a batch size of 8, each sequence is generated at an average speed of 5.8ms per token, the overall throughput being 1.1K tokens per second. These results represent state-of-the-art latency and a 2.15X…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsImage and Video Quality Assessment · Visual Attention and Saliency Detection · Generative Adversarial Networks and Image Synthesis
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Focus
