TL;DR
Polar Express is a GPU-efficient method for computing the matrix polar decomposition, optimized for deep learning applications like the Muon optimizer, offering rapid convergence and improved neural network training performance.
Contribution
We introduce Polar Express, a novel, minimax-optimized matrix sign method tailored for GPU efficiency and deep learning, outperforming existing algorithms in speed and accuracy.
Findings
Polar Express converges rapidly both initially and asymptotically.
It improves validation loss in GPT-2 training on large datasets.
The method is practical in finite-precision environments like bfloat16.
Abstract
Computing the polar decomposition and the related matrix sign function has been a well-studied problem in numerical analysis for decades. Recently, it has emerged as an important subroutine within the Muon optimizer for training deep neural networks. However, the requirements of this application differ sharply from classical settings: deep learning demands GPU-friendly algorithms that prioritize high throughput over high precision. We introduce Polar Express, a new method for computing the polar decomposition. Like Newton-Schulz and other classical polynomial methods, our approach uses only matrix-matrix multiplications, making it very efficient on GPUs. Inspired by earlier work of Chen & Chow and Nakatsukasa & Freund, Polar Express adapts the update rule at each iteration by solving a minimax optimization problem. We prove that this strategy minimizes error in a worst-case sense,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
