TL;DR
MegaTrain is a memory-centric system that enables full-precision training of models exceeding 100 billion parameters on a single GPU by leveraging host memory and optimized streaming techniques.
Contribution
It introduces a novel memory-centric training system that overcomes GPU memory limitations for extremely large models using host memory and streaming optimizations.
Findings
Successfully trains 120B parameter models on a single GPU.
Achieves 1.84x throughput compared to DeepSpeed ZeRO-3 with CPU offloading.
Enables training 7B models with 512k token context on a single GH200.
Abstract
We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing persistent device state. To battle the CPU-GPU bandwidth bottleneck, we adopt two key optimizations. 1) We introduce a pipelined double-buffered execution engine that overlaps parameter prefetching, computation, and gradient offloading across multiple CUDA streams, enabling continuous GPU execution. 2) We replace persistent autograd graphs with stateless layer templates, binding weights dynamically as they stream in, eliminating persistent graph metadata while providing flexibility in scheduling. On a single…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
