A Training-free Sub-quadratic Cost Transformer Model Serving Framework   With Hierarchically Pruned Attention

Heejun Lee; Geon Park; Youngwan Lee; Jaduk Suh; Jina Kim; Wonyoung; Jeong; Bumsik Kim; Hyemin Lee; Myeongjae Jeon; Sung Ju Hwang

arXiv:2406.09827·cs.CL·January 24, 2025

A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention

Heejun Lee, Geon Park, Youngwan Lee, Jaduk Suh, Jina Kim, Wonyoung, Jeong, Bumsik Kim, Hyemin Lee, Myeongjae Jeon, Sung Ju Hwang

PDF

Open Access

TL;DR

This paper introduces HiP, a training-free, hierarchical pruning attention framework that reduces complexity and memory usage in large language models, enabling efficient long-context processing without retraining.

Contribution

The paper presents a novel, training-free attention mechanism with sub-quadratic complexity, leveraging attention locality and a tree-search algorithm for scalable long-context LLMs.

Findings

01

Significantly reduces prefill and decoding latency.

02

Maintains high-quality generation with minimal degradation.

03

Enables scaling LLMs to millions of tokens on commodity GPUs.

Abstract

In modern large language models (LLMs), increasing the context length is crucial for improving comprehension and coherence in long-context, multi-modal, and retrieval-augmented language generation. While many recent transformer models attempt to extend their context length over a million tokens, they remain impractical due to the quadratic time and space complexities. Although recent works on linear and sparse attention mechanisms can achieve this goal, their real-world applicability is often limited by the need to re-train from scratch and significantly worse performance. In response, we propose a novel approach, Hierarchically Pruned Attention (HiP), which reduces the time complexity of the attention mechanism to $O (T lo g T)$ and the space complexity to $O (T)$ , where $T$ is the sequence length. We notice a pattern in the attention scores of pretrained LLMs where tokens close together…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Parallel Computing and Optimization Techniques · Advanced Neural Network Applications