Paged Attention Meets FlexAttention: Unlocking Long-Context Efficiency in Deployed Inference

Thomas Joshi; Herman Saini; Neil Dhillon; Antoni Viros i Martin; Kaoutar El Maghraoui

arXiv:2506.07311·cs.LG·June 10, 2025

Paged Attention Meets FlexAttention: Unlocking Long-Context Efficiency in Deployed Inference

Thomas Joshi, Herman Saini, Neil Dhillon, Antoni Viros i Martin, Kaoutar El Maghraoui

PDF

Open Access

TL;DR

This paper presents a novel attention mechanism combining PagedAttention with FlexAttention to improve long-context inference efficiency in large language models, reducing latency and memory fragmentation.

Contribution

It introduces a new integrated attention kernel that addresses memory inefficiencies and internal fragmentation in long-context inference, implemented within IBM's FMS and open-sourced.

Findings

01

Significantly reduced inference latency on NVIDIA L4 GPU.

02

Linear growth in latency with sequence length when using global KV cache.

03

Minimal increase in peak memory usage for sequence lengths up to 2048 tokens.

Abstract

Large Language Models (LLMs) encounter severe memory inefficiencies during long-context inference due to conventional handling of key-value (KV) caches. In this work, we introduce a novel integration of PagedAttention with PyTorch's FlexAttention, addressing internal fragmentation and inefficiencies associated with monolithic KV cache allocations. Implemented within IBM's Foundation Model Stack (FMS), our fused attention kernel efficiently gathers scattered KV data. Our benchmarks on an NVIDIA L4 GPU (24GB) demonstrate significantly reduced inference latency, growing only linearly (~2x) with sequence length from 128 to 2048 tokens when utilizing a global KV cache, compared to exponential latency increases without caching. While peak memory usage remains largely unchanged for single-step evaluations (dominated by model weights and activations), paged attention causes minimal incremental…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Multimodal Machine Learning Applications

MethodsFragmentation