TL;DR
Horizon-LM introduces a memory-centric training system that leverages host memory as the primary parameter store, enabling scalable large-model training on single GPUs with predictable memory usage and high throughput.
Contribution
It redefines GPU roles by treating host memory as the main parameter store and eliminates persistent GPU modules, allowing training of models up to 120B parameters on a single GPU.
Findings
Trains models up to 120B parameters on a single GPU.
Achieves 12.2× higher throughput than DeepSpeed ZeRO-3.
Maintains predictable memory growth and high device utilization.
Abstract
The rapid growth of large language models (LLMs) has outpaced the evolution of single-GPU hardware, making model scale increasingly constrained by memory capacity rather than computation. While modern training systems extend GPU memory through distributed parallelism and offloading across CPU and storage tiers, they fundamentally retain a GPU-centric execution paradigm in which GPUs host persistent model replicas and full autograd graphs. As a result, scaling large models remains tightly coupled to multi-GPU clusters, complex distributed runtimes, and unpredictable host memory consumption, creating substantial barriers for node-scale post-training workloads such as instruction tuning, alignment, and domain adaptation. We present Horizon-LM, a memory-centric training system that redefines the roles of CPU and GPU for large-model optimization. Horizon-LM treats host memory as the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
