SparQ Attention: Bandwidth-Efficient LLM Inference

Luka Ribar; Ivan Chelombiev; Luke Hudlass-Galley; Charlie Blake; Carlo; Luschi; Douglas Orr

arXiv:2312.04985·cs.LG·September 5, 2024·2 cites

SparQ Attention: Bandwidth-Efficient LLM Inference

Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo, Luschi, Douglas Orr

PDF

Open Access 1 Repo

TL;DR

SparQ Attention enhances large language model inference efficiency by reducing data transfer bottlenecks through selective memory fetching, enabling faster processing without retraining or fine-tuning.

Contribution

It introduces a novel attention technique that improves inference throughput by optimizing memory bandwidth usage, applicable to existing LLMs without retraining.

Findings

01

Up to 8x reduction in attention data transfers

02

Minimal accuracy loss across multiple models and tasks

03

Applicable to off-the-shelf LLMs without modifications

Abstract

The computational difficulties of large language model (LLM) inference remain a significant obstacle to their widespread deployment. The need for many applications to support long input sequences and process them in large batches typically causes token-generation to be bottlenecked by data transfer. For this reason, we introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by utilising memory bandwidth more efficiently within the attention layers, through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show that SparQ Attention brings up to 8x savings in attention data transfers without substantial drops in accuracy, by evaluating Llama 2 and 3, Mistral, Gemma and Pythia models on a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

graphcore-research/llm-inference-research
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Algorithms

MethodsPythia