LiveR: Fine-Grained Elasticity via Live Reconfiguration for Model Training
Haoyuan Liu, Kairui Zhou, Shuyao Qi, Qinwei Yang, Shengkai Lin, Shizhen Zhao, Wei Zhang

TL;DR
LiveR enables fast, live reconfiguration of large language model training on volatile GPU resources, significantly reducing downtime and maintaining high training throughput without checkpoint-based restarts.
Contribution
LiveR introduces a live, bounded-memory reconfiguration runtime that replaces checkpoint-based restarts with asynchronous state streaming and online reshaping for elastic LLM training.
Findings
Reduces reconfiguration downtime from minutes to seconds.
Accelerates reconfiguration speed by 14-23 times over checkpoint methods.
Maintains up to 99% training throughput under volatile resource conditions.
Abstract
To reduce user costs and maximize cluster utilization, large model training increasingly leverages volatile but inexpensive GPU capacity, such as spot instances and reclaimable resources in shared clusters. Yet, capitalizing on these economic benefits requires jobs to adapt within the short warning windows that many such environments provide. Existing elastic training systems still treat reconfiguration as stop-and-restart: they externalize distributed state through checkpoints, rebuild the distributed runtime on a new topology, and restart training, turning each resize event into a storage-heavy recovery procedure that incurs substantial downtime from checkpoint I/O, process restart, CUDA initialization, and communicator setup. We present LiveR, a live reconfiguration runtime for elastic LLM training that replaces storage-backed restart with a live, bounded-memory handoff between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
