Matrix-Free Least Squares Solvers: Values, Gradients, and What to Do With Them

Hrittik Roy; S{\o}ren Hauberg; Nicholas Kr\"amer

arXiv:2510.19634·cs.LG·October 23, 2025

Matrix-Free Least Squares Solvers: Values, Gradients, and What to Do With Them

Hrittik Roy, S{\o}ren Hauberg, Nicholas Kr\"amer

PDF

Open Access 3 Reviews

TL;DR

This paper develops differentiable least squares solvers with custom gradients, enabling scalable, constrained, and hyperparameter-tuned applications in machine learning, thus expanding the method's utility beyond traditional linear fitting.

Contribution

It introduces custom gradients for least squares solvers, transforming them into differentiable operators suitable for diverse machine learning tasks.

Findings

01

Scalability demonstrated on a 50 million parameter model with weight sparsity.

02

Imposing conservativeness constraints in score-based generative models.

03

Hyperparameter tuning of Gaussian processes improves predictive performance.

Abstract

This paper argues that the method of least squares has significant unfulfilled potential in modern machine learning, far beyond merely being a tool for fitting linear models. To release its potential, we derive custom gradients that transform the solver into a differentiable operator, like a neural network layer, enabling many diverse applications. Empirically, we demonstrate: (i) scalability by enforcing weight sparsity on a 50 million parameter model; (ii) imposing conservativeness constraints in score-based generative models; and (iii) hyperparameter tuning of Gaussian processes based on predictive performance. By doing this, our work represents the next iteration in developing differentiable linear-algebra tools and making them widely accessible to machine learning practitioners.

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 5

Strengths

The topic of differentiating through optimization problem is interesting, especially since the resurgence of implicit differentiation-based approach like "Tiny Recursive Models".

Weaknesses

Weaknesses: - Clarity: - I did not understand the proposed approach, do authors use a different solver than conjugate gradient? and then implicit differentiation? - What does the costs in Table 1 represent? To my knowledge, direct methods are $O(n^3)$. Is this the memory cost? Is this the computer cost of "one iteration" of each method? - What exactly is the proposed approach? Is it equation 10b? - Novelty: - I do not understand the novelty, especially since authors use an itera

Reviewer 02Rating 6Confidence 4

Strengths

The paper has the following strengths: - **Originality:** Theorem 1 extends beyond prior work by Golub & Pereyra (1973) and Krämer et al. (2024) by providing reverse-mode derivatives for adaptive least-squares solvers with regularization terms, handling both tall and wide cases in a unified framework. The reformulation of Yamashita (1980)'s null-space algorithm as a least-squares projection is particularly creative, enabling practical large-scale constrained optimization on models with up to 50

Weaknesses

- **Full Rank Assumption**: The most critical limitation is the full-rank assumption that pervades the entire paper. The paper does not guide diagnosing rank deficiency, choosing regularization λ to stabilize near-singular cases, or understanding what happens numerically when the assumption is violated. Even when matrices are technically full rank, severe ill-conditioning can produce similar numerical failures. Moreover, no experiments report the numerical rank or condition number of $J_c$, leav

Reviewer 03Rating 6Confidence 4

Strengths

- The paper is well-written with concise explanations and helpful supplementary material in the appendix. - The proposed VJP procedure demonstrates strong computational efficiency, as supported by timing comparisons. - The introduced framework and accompanying library provide a practical and flexible means to incorporate constraints optimization setups.

Weaknesses

**Nuancing statements** In Section 2.1, matrix-free methods are presented as ideal, yet it must be must highlighted that they also require access to a procedure to evaluate $A^\top u$. When such access is unavailable, matrix-free approaches lose their advantage and can become as expensive as direct linear-system methods. Similarly, Equation (4), which deduce $A^\top u$ via a vjp computation, should be tempered: vjp can introduce non-negligible computational overhead; see [2, Table 7.1]. **Ambig

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaussian Processes and Bayesian Inference · Stochastic Gradient Optimization Techniques · Model Reduction and Neural Networks