InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on   a Single GPU

Heejun Lee; Geon Park; Jaduk Suh; Sung Ju Hwang

arXiv:2502.08910·cs.CL·February 14, 2025

InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU

Heejun Lee, Geon Park, Jaduk Suh, Sung Ju Hwang

PDF

Open Access

TL;DR

InfiniteHiP is a practical framework that enables large language models to process up to 3 million tokens on a single GPU by dynamically pruning irrelevant tokens and optimizing memory usage, significantly improving long-context handling.

Contribution

The paper introduces InfiniteHiP, a novel inference framework that extends LLM context length to 3 million tokens on a single GPU through hierarchical token pruning and memory optimization techniques.

Findings

01

Enables processing of 3 million tokens on a single GPU

02

Achieves 18.95x speedup in attention decoding for 1 million tokens

03

No permanent loss of context information during long sequence processing

Abstract

In modern large language models (LLMs), handling very long context lengths presents significant challenges as it causes slower inference speeds and increased memory costs. Additionally, most existing pre-trained LLMs fail to generalize beyond their original training sequence lengths. To enable efficient and practical long-context utilization, we introduce InfiniteHiP, a novel, and practical LLM inference framework that accelerates processing by dynamically eliminating irrelevant context tokens through a modular hierarchical token pruning algorithm. Our method also allows generalization to longer sequences by selectively applying various RoPE adjustment methods according to the internal attention patterns within LLMs. Furthermore, we offload the key-value cache to host memory during inference, significantly reducing GPU memory pressure. As a result, InfiniteHiP enables the processing of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsSoftmax · Attention Is All You Need · Pruning