InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU
Heejun Lee, Geon Park, Jaduk Suh, Sung Ju Hwang

TL;DR
InfiniteHiP is a practical framework that enables large language models to process up to 3 million tokens on a single GPU by dynamically pruning irrelevant tokens and optimizing memory usage, significantly improving long-context handling.
Contribution
The paper introduces InfiniteHiP, a novel inference framework that extends LLM context length to 3 million tokens on a single GPU through hierarchical token pruning and memory optimization techniques.
Findings
Enables processing of 3 million tokens on a single GPU
Achieves 18.95x speedup in attention decoding for 1 million tokens
No permanent loss of context information during long sequence processing
Abstract
In modern large language models (LLMs), handling very long context lengths presents significant challenges as it causes slower inference speeds and increased memory costs. Additionally, most existing pre-trained LLMs fail to generalize beyond their original training sequence lengths. To enable efficient and practical long-context utilization, we introduce InfiniteHiP, a novel, and practical LLM inference framework that accelerates processing by dynamically eliminating irrelevant context tokens through a modular hierarchical token pruning algorithm. Our method also allows generalization to longer sequences by selectively applying various RoPE adjustment methods according to the internal attention patterns within LLMs. Furthermore, we offload the key-value cache to host memory during inference, significantly reducing GPU memory pressure. As a result, InfiniteHiP enables the processing of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsSoftmax · Attention Is All You Need · Pruning
