Squeezed Attention: Accelerating Long Context Length LLM Inference

Coleman Hooper; Sehoon Kim; Hiva Mohammadzadeh; Monishwaran Maheswaran; Sebastian Zhao; June Paik; Michael W. Mahoney; Kurt Keutzer; Amir Gholami

arXiv:2411.09688·cs.CL·October 17, 2025

Squeezed Attention: Accelerating Long Context Length LLM Inference

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Monishwaran Maheswaran, Sebastian Zhao, June Paik, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

PDF

Open Access 1 Repo

TL;DR

Squeezed Attention introduces an offline clustering-based method to accelerate long-context LLM inference by reducing attention computation to relevant key subsets, achieving significant speedups with minimal accuracy loss.

Contribution

The paper presents a novel offline clustering approach combined with hierarchical attention to significantly reduce inference costs for long-context LLMs, with practical speedup implementations.

Findings

01

3.1× reduction in KV memory usage without accuracy loss

02

Up to 8× reduction in attention computation with minimal accuracy gap

03

Over 4× speedup in inference phases using optimized kernels

Abstract

Emerging Large Language Model (LLM) applications require long input context in order to perform complex tasks like document analysis and code generation. For these long context length applications, the length of the input prompt poses a significant challenge in terms of inference efficiency since the inference costs increase linearly with sequence length. However, for many of these applications, much of the context in the prompt is fixed across different user inputs, thereby providing the opportunity to perform offline optimizations in order to process user inputs quickly, as they are received. We propose Squeezed Attention to accelerate LLM applications where a large portion of the input context is fixed. We first leverage K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value. During…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SqueezeAILab/SqueezedAttention
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling

MethodsSoftmax · Attention Is All You Need · k-Means Clustering