LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding
Gang Lin, Dongfang Li, Zhuoen Chen, Yukun Shi, Xuhui Chen, Baotian Hu, Min Zhang

TL;DR
LycheeDecode introduces a hybrid-head sparse decoding method that significantly accelerates long-context LLM inference while maintaining high generative quality, by dynamically selecting crucial tokens with a fine-grained attention mechanism.
Contribution
It proposes a novel hybrid-head attention mechanism with a hardware-efficient top-k selection, improving long-context inference speed without sacrificing model performance.
Findings
Achieves up to 2.7x speedup at 128K context length.
Maintains comparable or better generative quality than full-attention models.
Effective on models like Llama3 and Qwen3 across diverse benchmarks.
Abstract
The proliferation of long-context large language models (LLMs) exposes a key bottleneck: the rapidly expanding key-value cache during decoding, which imposes heavy memory and latency costs. While recent approaches attempt to alleviate this by sharing a single set of crucial tokens across layers, such coarse-grained sharing undermines model performance by neglecting the functional diversity of attention heads. To address this, we propose LycheeDecode, an efficient decoding method centered on a fine-grained hybrid-head attention mechanism that employs a hardware-efficient top-k selection strategy. Specifically, the novel HardKuma-based mechanism partitions attention heads into a small subset of retrieval heads that dynamically identify crucial tokens and a majority of sparse heads that reuse them for efficient computation. Through extensive experiments on leading models like Llama3 and…
Peer Reviews
Decision·ICLR 2026 Poster
The paper presents a well-motivated approach. The experimental results demonstrate that LycheeDecode achieves performance comparable to full attention baselines on complex reasoning tasks.
1. Recent work has demonstrated that trainable sparse attention can also achieve efficient decoding [1-3]. The paper lacks discussion and empirical comparison with these methods. [1] Native sparse attention: Hardware-aligned and natively trainable sparse attention [2] Minicpm4: Ultra-efficient llms on end devices [3] SeerAttention-R: Sparse Attention Adaptation for Long Reasoning 2. While the paper shows kernel-level speedup for different sparse head ratios, there is no corresponding analysis
1. The paper provides compelling motivation that layer-level token sharing ignores significant functional diversity among heads, with empirical evidence showing highly variable top-k overlap across adjacent layers. 2. Trainable retrieval and sparse head assignment, HardKuma offers a principled, differentiable relaxation that tends toward binary outcomes without rounding. 3. The custom hybrid-head kernel yields substantial kernel-level improvements, this strengthens the practicality claim.
1. Lack of comparison with training-based long-context inference, DuoAttention is arguably the most directly comparable prior work, and the omission makes it difficult to isolate novelty beyond the use of top-k propagation. 2. Head role consistency and interpretability are not evaluated, how stable head assignments are across different settings (random setting), whether specialization generalizes to unseen domains. 3. Training complexity is under-characterized. The method computes both full and
1. The paper's primary innovation is the hybrid-head decoding mechanism. This mechanism establishes a head-indexed pipeline, where a Retrieval Head in one layer selects tokens specifically for its corresponding Sparse Head in the next layer to reuse. 2. The use of the HardKuma distribution directly targets a known weakness (train-inference discrepancy) in prior training-based specialization methods, leading to a more stable and direct optimization of the discrete head roles.
1. Missing Key Baseline Comparisons: The paper's central claim is that its cooperative head-specialization architecture is superior. However, it fails to provide any end-to-end performance or speed comparisons against the most direct SOTA competitors in the head-specialization sub-field (e.g., DuoAttention, RazorAttention). It only compares against a layer-sharing method (TidalDecode). This makes the SOTA claim unsubstantiated, as we cannot see how it performs against other architectures with th
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Healthcare · Natural Language Processing Techniques
