Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU

Jevin Jiang; Ying Chen; Blake A. Hechtman; Fenghui Zhang; Yarong Mu

arXiv:2604.15464·cs.PF·April 20, 2026

Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU

Jevin Jiang, Ying Chen, Blake A. Hechtman, Fenghui Zhang, Yarong Mu

PDF

TL;DR

Ragged Paged Attention (RPA) is a novel TPU kernel for LLM inference that improves performance and flexibility by addressing ragged execution patterns with innovative techniques.

Contribution

The paper introduces RPA, a high-performance TPU attention kernel with techniques like fine-grained tiling, custom pipelining, and distribution-aware compilation for efficient LLM inference.

Findings

01

Achieves up to 86% memory bandwidth utilization in decode

02

Reaches 73% model FLOPs utilization in prefill

03

Provides a production-grade TPU inference foundation

Abstract

Large Language Model (LLM) deployment is increasingly shifting to cost-efficient accelerators like Google's Tensor Processing Units (TPUs), prioritizing both performance and total cost of ownership (TCO). However, existing LLM inference kernels and serving systems remain largely GPU-centric, and there is no well-established approach for efficiently mapping LLM workloads onto TPU architectures--particularly under the dynamic and ragged execution patterns common in modern serving. In this paper, we present Ragged Paged Attention (RPA), a high-performance and flexible attention kernel for TPUs, implemented using Pallas and Mosaic. RPA addresses these challenges through three key techniques: (1) fine-grained tiling to enable efficient dynamic slicing over ragged memory, (2) a custom software pipeline that fuses KV cache updates with attention computation, and (3) a distribution-aware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.