DASH: Faster Shampoo via Batched Block Preconditioning and Efficient Inverse-Root Solvers

Ionut-Vlad Modoranu; Philip Zmushko; Erik Schultheis; Mher Safaryan; Dan Alistarh

arXiv:2602.02016·cs.LG·February 3, 2026

DASH: Faster Shampoo via Batched Block Preconditioning and Efficient Inverse-Root Solvers

Ionut-Vlad Modoranu, Philip Zmushko, Erik Schultheis, Mher Safaryan, Dan Alistarh

PDF

Open Access

TL;DR

DASH introduces a faster distributed Shampoo optimizer using batched block preconditioning and novel inverse-root solvers, significantly reducing computation time while maintaining or improving model performance.

Contribution

The paper presents a new implementation of Distributed Shampoo with techniques like 3D block stacking and Newton-DB iteration, enhancing efficiency and convergence analysis.

Findings

01

Achieves up to 4.83x faster optimizer steps.

02

Newton-DB yields lowest validation perplexity per iteration.

03

Provides in-depth analysis of matrix scaling effects.

Abstract

Shampoo is one of the leading approximate second-order optimizers: a variant of it has won the MLCommons AlgoPerf competition, and it has been shown to produce models with lower activation outliers that are easier to compress. Yet, applying Shampoo currently comes at the cost of significant computational slowdown, due to its expensive internal operations. In this paper, we take a significant step to address this shortcoming by proposing \method (for \textbf{D}istributed \textbf{A}ccelerated \textbf{SH}ampoo), a faster implementation of Distributed Shampoo based on two main new techniques: First, we show that preconditioner blocks can be stacked into 3D tensors to significantly improve GPU utilization; second, we introduce the Newton-DB iteration and the Chebyshev polynomial approximations as novel and faster approaches for computing the inverse matrix roots required by Shampoo. Along…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTensor decomposition and applications · Stochastic Gradient Optimization Techniques · Model Reduction and Neural Networks