HyLRA: Hybrid Layer Reuse Attention for Efficient Long-Context Inference

Xuan Ai; Qingqing Yang; Peng Wang; Lei Deng; Lin Zhang; Renhai Chen; Gong Zhang

arXiv:2602.00777·cs.CL·February 3, 2026

HyLRA: Hybrid Layer Reuse Attention for Efficient Long-Context Inference

Xuan Ai, Qingqing Yang, Peng Wang, Lei Deng, Lin Zhang, Renhai Chen, Gong Zhang

PDF

Open Access

TL;DR

HyLRA introduces a hybrid attention mechanism that selectively reuses token information across layers, significantly improving long-context inference efficiency in large language models while maintaining accuracy.

Contribution

The paper proposes HyLRA, a novel layer-wise sparse attention framework that balances full attention and token reuse, optimizing long-context inference in LLMs.

Findings

01

Achieves 6-46% faster inference throughput.

02

Maintains less than 1% accuracy degradation.

03

Outperforms existing sparse attention methods.

Abstract

Long-context inference in Large Language Models (LLMs) is bottlenecked by the quadratic computation complexity of attention and the substantial memory footprint of Key-Value (KV) caches. While existing sparse attention mechanisms attempt to mitigate this by exploiting inherent sparsity, they often rely on rigid patterns or aggressive pruning, failing to achieve an optimal balance between efficiency and accuracy. In this paper, we introduce {\bf HyLRA} ({\bf Hy}brid {\bf L}ayer {\bf R}euse {\bf A}ttention), a novel framework driven by layer-wise sparsity profiling. Our empirical analysis uncovers a dual characteristic in attention mechanics: \textit{intra-layer sensitivity}, where specific layers necessitate full attention to prevent feature distortion, and \textit{inter-layer similarity}, where consecutive layers share substantial critical tokens. Based on these observations, HyLRA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Domain Adaptation and Few-Shot Learning · Topic Modeling