Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU
Jevin Jiang, Ying Chen, Blake A. Hechtman, Fenghui Zhang, Yarong Mu

TL;DR
Ragged Paged Attention (RPA) is a novel TPU kernel for LLM inference that improves performance and flexibility by addressing ragged execution patterns with innovative techniques.
Contribution
The paper introduces RPA, a high-performance TPU attention kernel with techniques like fine-grained tiling, custom pipelining, and distribution-aware compilation for efficient LLM inference.
Findings
Achieves up to 86% memory bandwidth utilization in decode
Reaches 73% model FLOPs utilization in prefill
Provides a production-grade TPU inference foundation
Abstract
Large Language Model (LLM) deployment is increasingly shifting to cost-efficient accelerators like Google's Tensor Processing Units (TPUs), prioritizing both performance and total cost of ownership (TCO). However, existing LLM inference kernels and serving systems remain largely GPU-centric, and there is no well-established approach for efficiently mapping LLM workloads onto TPU architectures--particularly under the dynamic and ragged execution patterns common in modern serving. In this paper, we present Ragged Paged Attention (RPA), a high-performance and flexible attention kernel for TPUs, implemented using Pallas and Mosaic. RPA addresses these challenges through three key techniques: (1) fine-grained tiling to enable efficient dynamic slicing over ragged memory, (2) a custom software pipeline that fuses KV cache updates with attention computation, and (3) a distribution-aware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
