SparseAccelerate: Efficient Long-Context Inference for Mid-Range GPUs

James Vo

arXiv:2412.06198·cs.CL·December 10, 2024

SparseAccelerate: Efficient Long-Context Inference for Mid-Range GPUs

James Vo

PDF

Open Access

TL;DR

SparseAccelerate introduces a dynamic sparse attention method that adapts to input characteristics, significantly reducing latency and memory usage for long-context LLM inference on mid-range GPUs, enabling more practical real-time applications.

Contribution

It presents a novel dynamic sparse attention technique that scales efficiently up to 128K tokens, outperforming existing methods in latency and memory savings for long-context inference.

Findings

01

Achieves up to 1.04x reduction in TTFT latency at 32K tokens

02

Scales efficiently up to 128K tokens on dual NVIDIA A5000 GPUs

03

Demonstrates the smallest TTFT growth gradient among competing methods

Abstract

As Large Language Models (LLMs) scale to longer context windows, the computational cost of attention mechanisms, which traditionally grows quadratically with input length, presents a critical challenge for real-time and memory-constrained deployments. Existing sparse attention techniques have sought to reduce this complexity, but they often incur significant overhead or compromise accuracy, making them less practical for large contexts on mid-range hardware. In this paper, we introduce SparseAccelerate, a dynamic sparse attention method that adapts its sparsity patterns based on input characteristics, effectively flattening the attention complexity curve. Our approach is effective for input lengths starting at 16K tokens and scales efficiently up to 128K tokens on dual NVIDIA A5000 GPUs (24GB each). Experimental results show that SparseAccelerate achieves up to a 1.04x reduction in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Medical Image Segmentation Techniques

MethodsSoftmax · Attention Is All You Need