CurvaDion: Curvature-Adaptive Distributed Orthonormalization
Bhavesh Kumar, Roger Jin, Jeffrey Quesnelle

TL;DR
CurvaDion adaptively reduces communication in distributed training of large language models by detecting high-curvature regions with RMMC, maintaining convergence while significantly lowering synchronization overhead.
Contribution
Introduces RMMC, a novel curvature proxy, enabling dynamic synchronization in distributed training, reducing communication by 99% without sacrificing convergence.
Findings
Achieves 99% communication reduction in large-scale training.
Maintains baseline convergence across models from 160M to 1.3B parameters.
Theoretically links RMMC to loss curvature.
Abstract
As language models scale to trillions of parameters, distributed training across many GPUs becomes essential, yet gradient synchronization over high-bandwidth, low-latency networks remains a critical bottleneck. While recent methods like Dion reduce per-step communication through low-rank updates, they synchronize at every step regardless of the optimization landscape. We observe that synchronization requirements vary dramatically throughout training: workers naturally compute similar gradients in flat regions, making frequent synchronization redundant, while high-curvature regions require coordination to prevent divergence. We introduce CurvaDion, which uses Relative Maximum Momentum Change (RMMC) to detect high-curvature regions requiring synchronization. RMMC leverages momentum dynamics which are already computed during optimization as a computationally tractable proxy for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning in Materials Science · Generative Adversarial Networks and Image Synthesis
