BASS: Batched Attention-optimized Speculative Sampling

Haifeng Qian; Sujan Kumar Gonugondla; Sungsoo Ha; Mingyue Shang,; Sanjay Krishna Gouda; Ramesh Nallapati; Sudipta Sengupta; Xiaofei Ma; Anoop; Deoras

arXiv:2404.15778·cs.LG·June 27, 2024

BASS: Batched Attention-optimized Speculative Sampling

Haifeng Qian, Sujan Kumar Gonugondla, Sungsoo Ha, Mingyue Shang,, Sanjay Krishna Gouda, Ramesh Nallapati, Sudipta Sengupta, Xiaofei Ma, Anoop, Deoras

PDF

Open Access 1 Video

TL;DR

This paper introduces BASS, a batched speculative decoding system that significantly improves multi-sequence generation latency and GPU utilization, enabling faster and more efficient large language model responses.

Contribution

BASS is the first system to effectively perform batched speculative decoding, achieving state-of-the-art latency and GPU efficiency in multi-sequence generation.

Findings

01

Achieves 5.8ms per token for 7.8B model on A100 GPU

02

Provides 2.15X speed-up over optimized regular decoding

03

Reaches 15.8% GPU utilization during decoding

Abstract

Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting large language models. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications often require multiple responses and how to perform speculative decoding in a batched setting while preserving its latency benefits poses non-trivial challenges. This paper describes a system of batched speculative decoding that sets a new state of the art in multi-sequence generation latency and that demonstrates superior GPU utilization as well as quality of generations within a time budget. For example, for a 7.8B-size model on a single A100 GPU and with a batch size of 8, each sequence is generated at an average speed of 5.8ms per token, the overall throughput being 1.1K tokens per second. These results represent state-of-the-art latency and a 2.15X…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

BASS: Batched Attention-optimized Speculative Sampling· underline

Taxonomy

TopicsImage and Video Quality Assessment · Visual Attention and Saliency Detection · Generative Adversarial Networks and Image Synthesis

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Focus