CoDec: Prefix-Shared Decoding Kernel for LLMs

Zhibin Wang; Rui Ning; Chao Fang; Zhonghui Zhang; Xi Lin; Shaobo Ma; Mo Zhou; Xue Li; Zhongfeng Wang; Chengying Huan; Rong Gu; Kun Yang; Guihai Chen; Sheng Zhong; Chen Tian

arXiv:2505.17694·cs.LG·March 31, 2026

CoDec: Prefix-Shared Decoding Kernel for LLMs

Zhibin Wang, Rui Ning, Chao Fang, Zhonghui Zhang, Xi Lin, Shaobo Ma, Mo Zhou, Xue Li, Zhongfeng Wang, Chengying Huan, Rong Gu, Kun Yang, Guihai Chen, Sheng Zhong, Chen Tian

PDF

TL;DR

CoDec introduces a specialized attention kernel that leverages prefix-sharing in LLM decoding, significantly improving speed and reducing memory access during attention computation.

Contribution

The paper presents CoDec, a novel shared-prefix attention kernel that optimizes memory hierarchy and workload balancing for efficient prefix-sharing in LLM decoding.

Findings

01

Achieves 1.9× speedup over FlashDecoding

02

Reduces memory access by 120.9×

03

Speeds up end-to-end token generation by 3.8×

Abstract

Prefix-sharing among multiple prompts presents opportunities to combine the operations of the shared prefix, while attention computation in the decode stage, which becomes a critical bottleneck with increasing context lengths, is a memory-intensive process requiring heavy memory access on the key-value (KV) cache of the prefixes. Therefore, in this paper, we explore the potential of prefix-sharing in the attention computation of the decode stage. However, the tree structure of the prefix-sharing mechanism presents significant challenges for attention computation in efficiently processing shared KV cache access patterns while managing complex dependencies and balancing irregular workloads. To address the above challenges, we propose a dedicated attention kernel to combine the memory access of shared prefixes in the decoding stage, namely CoDec. CoDec delivers two key innovations: a novel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.