HyperOffload: Graph-Driven Hierarchical Memory Management for Large Language Models on SuperNode Architectures
Fangxin Liu, Qinghua Zhang, Hanjing Shen, Zhibo Liang, Li Jiang, Haibing Guan, Chong Bao, Xuefeng Jin

TL;DR
HyperOffload introduces a compiler-assisted, graph-driven memory management framework for large language models on SuperNode architectures, effectively reducing memory usage and hiding remote memory latency through static scheduling.
Contribution
It presents a novel compiler-based approach that explicitly models remote memory operations, enabling global analysis and scheduling for hierarchical SuperNode systems.
Findings
Reduces peak device memory by up to 26% during inference.
Maintains end-to-end performance despite memory optimizations.
Demonstrates the importance of compiler integration for hardware-aware memory management.
Abstract
The rapid evolution of Large Language Models (LLMs) towards long-context reasoning and sparse architectures has pushed memory requirements far beyond the capacity of individual device HBM. While emerging supernode architectures offer terabyte-scale shared memory pools via high-bandwidth interconnects, existing software stacks fail to exploit this hardware effectively. Current runtime-based offloading and swapping techniques operate with a local view, leading to reactive scheduling and exposed communication latency that stall the computation pipeline. In this paper, we propose the SuperNode Memory Management Framework (\textbf{HyperOffload}). It employs a compiler-assisted approach that leverages graph-driven memory management to treat remote memory access as explicit operations in the computation graph, specifically designed for hierarchical SuperNode architectures. Unlike reactive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Advanced Neural Network Applications
