A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention
Heejun Lee, Geon Park, Youngwan Lee, Jaduk Suh, Jina Kim, Wonyoung, Jeong, Bumsik Kim, Hyemin Lee, Myeongjae Jeon, Sung Ju Hwang

TL;DR
This paper introduces HiP, a training-free, hierarchical pruning attention framework that reduces complexity and memory usage in large language models, enabling efficient long-context processing without retraining.
Contribution
The paper presents a novel, training-free attention mechanism with sub-quadratic complexity, leveraging attention locality and a tree-search algorithm for scalable long-context LLMs.
Findings
Significantly reduces prefill and decoding latency.
Maintains high-quality generation with minimal degradation.
Enables scaling LLMs to millions of tokens on commodity GPUs.
Abstract
In modern large language models (LLMs), increasing the context length is crucial for improving comprehension and coherence in long-context, multi-modal, and retrieval-augmented language generation. While many recent transformer models attempt to extend their context length over a million tokens, they remain impractical due to the quadratic time and space complexities. Although recent works on linear and sparse attention mechanisms can achieve this goal, their real-world applicability is often limited by the need to re-train from scratch and significantly worse performance. In response, we propose a novel approach, Hierarchically Pruned Attention (HiP), which reduces the time complexity of the attention mechanism to and the space complexity to , where is the sequence length. We notice a pattern in the attention scores of pretrained LLMs where tokens close together…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Parallel Computing and Optimization Techniques · Advanced Neural Network Applications
