Error whitening: Why Gauss-Newton outperforms Newton
Maricela Best McKay, Nathan P. Lawrence, Brian Wetton, R. Bhushan Gopaluni

TL;DR
This paper explains why Gauss-Newton methods outperform Newton's method by analyzing their function space dynamics and introducing the concept of error whitening, supported by empirical evidence.
Contribution
The paper provides a function space perspective revealing how Gauss-Newton's error whitening property distinguishes it from Newton's method, with empirical validation across various tasks.
Findings
Gauss-Newton projects the loss gradient onto the model's tangent space, removing parameterization distortions.
Error whitening replaces the $JJ^\top$ matrix with the identity, simplifying the dynamics.
Gauss-Newton optimizers outperform Newton, Adam, and Muon in multiple case studies.
Abstract
The Gauss-Newton matrix is widely viewed as a positive semidefinite approximation of the Hessian, yet mounting empirical evidence shows that Gauss-Newton descent outperforms Newton's method. We adopt a function space perspective to analyze this phenomenon. We show that the generalized Gauss-Newton (GGN) matrix projects the Newton direction in function space onto the model's tangent space, while a Jacobian-only variant obtained by applying the least squares Gauss-Newton matrix to non-least squares losses projects the function space loss gradient onto this same tangent space. Both projections eliminate distortions from the model's parameterization. Specifically, the evolution of the prediction-target mismatch depends on the model's parameterization through the matrix where is the Jacobian of the model with respect to its parameters. The projections effectively replace…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
