Matrix-Free Least Squares Solvers: Values, Gradients, and What to Do With Them
Hrittik Roy, S{\o}ren Hauberg, Nicholas Kr\"amer

TL;DR
This paper develops differentiable least squares solvers with custom gradients, enabling scalable, constrained, and hyperparameter-tuned applications in machine learning, thus expanding the method's utility beyond traditional linear fitting.
Contribution
It introduces custom gradients for least squares solvers, transforming them into differentiable operators suitable for diverse machine learning tasks.
Findings
Scalability demonstrated on a 50 million parameter model with weight sparsity.
Imposing conservativeness constraints in score-based generative models.
Hyperparameter tuning of Gaussian processes improves predictive performance.
Abstract
This paper argues that the method of least squares has significant unfulfilled potential in modern machine learning, far beyond merely being a tool for fitting linear models. To release its potential, we derive custom gradients that transform the solver into a differentiable operator, like a neural network layer, enabling many diverse applications. Empirically, we demonstrate: (i) scalability by enforcing weight sparsity on a 50 million parameter model; (ii) imposing conservativeness constraints in score-based generative models; and (iii) hyperparameter tuning of Gaussian processes based on predictive performance. By doing this, our work represents the next iteration in developing differentiable linear-algebra tools and making them widely accessible to machine learning practitioners.
Peer Reviews
Decision·Submitted to ICLR 2026
The topic of differentiating through optimization problem is interesting, especially since the resurgence of implicit differentiation-based approach like "Tiny Recursive Models".
Weaknesses: - Clarity: - I did not understand the proposed approach, do authors use a different solver than conjugate gradient? and then implicit differentiation? - What does the costs in Table 1 represent? To my knowledge, direct methods are $O(n^3)$. Is this the memory cost? Is this the computer cost of "one iteration" of each method? - What exactly is the proposed approach? Is it equation 10b? - Novelty: - I do not understand the novelty, especially since authors use an itera
The paper has the following strengths: - **Originality:** Theorem 1 extends beyond prior work by Golub & Pereyra (1973) and Krämer et al. (2024) by providing reverse-mode derivatives for adaptive least-squares solvers with regularization terms, handling both tall and wide cases in a unified framework. The reformulation of Yamashita (1980)'s null-space algorithm as a least-squares projection is particularly creative, enabling practical large-scale constrained optimization on models with up to 50
- **Full Rank Assumption**: The most critical limitation is the full-rank assumption that pervades the entire paper. The paper does not guide diagnosing rank deficiency, choosing regularization λ to stabilize near-singular cases, or understanding what happens numerically when the assumption is violated. Even when matrices are technically full rank, severe ill-conditioning can produce similar numerical failures. Moreover, no experiments report the numerical rank or condition number of $J_c$, leav
- The paper is well-written with concise explanations and helpful supplementary material in the appendix. - The proposed VJP procedure demonstrates strong computational efficiency, as supported by timing comparisons. - The introduced framework and accompanying library provide a practical and flexible means to incorporate constraints optimization setups.
**Nuancing statements** In Section 2.1, matrix-free methods are presented as ideal, yet it must be must highlighted that they also require access to a procedure to evaluate $A^\top u$. When such access is unavailable, matrix-free approaches lose their advantage and can become as expensive as direct linear-system methods. Similarly, Equation (4), which deduce $A^\top u$ via a vjp computation, should be tempered: vjp can introduce non-negligible computational overhead; see [2, Table 7.1]. **Ambig
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaussian Processes and Bayesian Inference · Stochastic Gradient Optimization Techniques · Model Reduction and Neural Networks
