Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training
Yishun Lu, Junhao Zhang, Zeyu Yang, and Wes Armour

TL;DR
Asteria is a runtime system that enables scalable second-order optimization for large language model training by efficiently managing optimizer state across hardware and asynchronous computations.
Contribution
It introduces a novel runtime approach that separates second-order optimization logic from GPU training, enabling practical large-scale second-order LLM training.
Findings
Supports second-order training on a 1B-parameter model with limited GPU memory.
Reduces optimizer overhead and latency spikes on multi-node systems.
Accelerates convergence and maintains optimization benefits in large models.
Abstract
Second-order methods offer an attractive path toward more sample-efficient LLM training, but their practical use is often blocked by the systems cost of maintaining and updating large matrix-based optimizer states. We introduce \textbf{Asteria}, a runtime system designed to remove this bottleneck by separating second-order optimization logic from the critical GPU training path. Rather than keeping all preconditioner state on the accelerator, Asteria dynamically distributes optimizer state across GPU memory, CPU memory, and optional NVMe storage according to architectural constraints and runtime pressure. It further uses training hooks to prepare shadow states in advance, allowing expensive inverse-root computations to proceed asynchronously on the host while GPU computation continues. For distributed training, Asteria employs a bounded-staleness protocol that limits synchronization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
