HyLRA: Hybrid Layer Reuse Attention for Efficient Long-Context Inference
Xuan Ai, Qingqing Yang, Peng Wang, Lei Deng, Lin Zhang, Renhai Chen, Gong Zhang

TL;DR
HyLRA introduces a hybrid attention mechanism that selectively reuses token information across layers, significantly improving long-context inference efficiency in large language models while maintaining accuracy.
Contribution
The paper proposes HyLRA, a novel layer-wise sparse attention framework that balances full attention and token reuse, optimizing long-context inference in LLMs.
Findings
Achieves 6-46% faster inference throughput.
Maintains less than 1% accuracy degradation.
Outperforms existing sparse attention methods.
Abstract
Long-context inference in Large Language Models (LLMs) is bottlenecked by the quadratic computation complexity of attention and the substantial memory footprint of Key-Value (KV) caches. While existing sparse attention mechanisms attempt to mitigate this by exploiting inherent sparsity, they often rely on rigid patterns or aggressive pruning, failing to achieve an optimal balance between efficiency and accuracy. In this paper, we introduce {\bf HyLRA} ({\bf Hy}brid {\bf L}ayer {\bf R}euse {\bf A}ttention), a novel framework driven by layer-wise sparsity profiling. Our empirical analysis uncovers a dual characteristic in attention mechanics: \textit{intra-layer sensitivity}, where specific layers necessitate full attention to prevent feature distortion, and \textit{inter-layer similarity}, where consecutive layers share substantial critical tokens. Based on these observations, HyLRA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Domain Adaptation and Few-Shot Learning · Topic Modeling
