PAT: Accelerating LLM Decoding via Prefix-Aware Attention with Resource Efficient Multi-Tile Kernel

Jinjun Yi; Zhixin Zhao; Yitao Hu; Ke Yan; Weiwei Sun; Hao Wang; Laiping Zhao; Yuhao Zhang; Wenxin Li; Keqiu Li

arXiv:2511.22333·cs.DC·March 17, 2026

PAT: Accelerating LLM Decoding via Prefix-Aware Attention with Resource Efficient Multi-Tile Kernel

Jinjun Yi, Zhixin Zhao, Yitao Hu, Ke Yan, Weiwei Sun, Hao Wang, Laiping Zhao, Yuhao Zhang, Wenxin Li, Keqiu Li

PDF

TL;DR

This paper presents PAT, a prefix-aware attention kernel that significantly accelerates large language model decoding by reducing memory bandwidth and resource inefficiencies through shared prefix packing and multi-tile execution.

Contribution

PAT introduces a novel prefix-aware attention kernel with a pack-forward-merge paradigm, optimizing memory and resource usage for LLM decoding.

Findings

01

Reduces attention latency by 53.5% on average

02

Achieves 17.0-93.1% speedup over state-of-the-art kernels

03

Effectively exploits shared prefixes to improve decoding efficiency

Abstract

LLM serving is increasingly dominated by decode attention, which is a memory-bound operation due to massive KV cache loading from global memory. Meanwhile, real-world workloads exhibit substantial, hierarchical shared prefixes across requests (e.g., system prompts, tools/templates, RAG). Existing attention implementations fail to fully exploit prefix sharing: one-query-per-CTA execution repeatedly loads shared prefix KV cache, while one-size-fits-all tiling leaves on-chip resources idle and exacerbates bubbles for uneven KV lengths. These choices amplify memory bandwidth pressure and stall memory-bound decode attention. This paper introduces PAT, a prefix-aware attention kernel implementation for LLM decoding that organizes execution with a pack-forward-merge paradigm. PAT packs queries by shared prefix to reduce repeated memory accesses, runs a customized multi-tile kernel to achieve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.