Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts

Wenhao Li; Daohai Yu; Gen Luo; Yuxin Zhang; Fei Chao; Rongrong Ji; Yifan Wu; Jiaxin Liu; Ziyang Gong; Zimu Liao

arXiv:2602.02108·cs.CL·March 3, 2026

Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts

Wenhao Li, Daohai Yu, Gen Luo, Yuxin Zhang, Fei Chao, Rongrong Ji, Yifan Wu, Jiaxin Liu, Ziyang Gong, Zimu Liao

PDF

Open Access

TL;DR

OOMB introduces a memory-efficient training system for large language models with long contexts, enabling training on a single GPU with minimal memory overhead by using innovative activation recomputation and cache management techniques.

Contribution

It presents a novel chunk-recurrent training framework with on-the-fly activation recomputation and cache optimizations, significantly reducing memory usage for long-context LLM training.

Findings

01

Memory overhead increases only 10MB per 10K tokens of context.

02

Enables training Qwen2.5-7B with 4M-token context on a single GPU.

03

Achieves near-linear scaling of memory efficiency with context length.

Abstract

Training Large Language Models (LLMs) on long contexts is severely constrained by prohibitive GPU memory overhead, not training time. The primary culprits are the activations, whose memory footprints scale linearly with sequence length. We introduce OOMB, a highly memory-efficient training system that directly confronts this barrier. Our approach employs a chunk-recurrent training framework with on-the-fly activation recomputation, which maintains a constant activation memory footprint (O(1)) and shifts the primary bottleneck to the growing KV cache. To manage the KV cache, OOMB integrates a suite of synergistic optimizations: a paged memory manager for both the KV cache and its gradients to eliminate fragmentation, asynchronous CPU offloading to hide data transfer latency, and page-level sparse attention to reduce both computational complexity and communication overhead. The synergy of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Big Data and Digital Economy