Back to Square Roots: An Optimal Bound on the Matrix Factorization Error for Multi-Epoch Differentially Private SGD

Nikita P. Kalinin; Ryan McKenna; Jalaj Upadhyay; Christoph H. Lampert

arXiv:2505.12128·cs.CR·March 3, 2026

Back to Square Roots: An Optimal Bound on the Matrix Factorization Error for Multi-Epoch Differentially Private SGD

Nikita P. Kalinin, Ryan McKenna, Jalaj Upadhyay, Christoph H. Lampert

PDF

Open Access 3 Reviews

TL;DR

This paper introduces BISR, a new matrix factorization method for multi-epoch differentially private SGD, providing tight error bounds and matching optimality, with practical efficiency and simplicity.

Contribution

The paper presents BISR, a novel banded inverse square root factorization that achieves asymptotically optimal error bounds for multi-epoch differential privacy, improving theoretical understanding and practical implementation.

Findings

01

BISR matches the theoretical optimal error bounds.

02

BISR performs comparably to state-of-the-art methods empirically.

03

BISR is simpler, more efficient, and easier to analyze.

Abstract

Matrix factorization mechanisms for differentially private training have emerged as a promising approach to improve model utility under privacy constraints. In practical settings, models are typically trained over multiple epochs, requiring matrix factorizations that account for repeated participation. Existing theoretical upper and lower bounds on multi-epoch factorization error leave a significant gap. In this work, we introduce a new explicit factorization method, Banded Inverse Square Root (BISR), which imposes a banded structure on the inverse correlation matrix. This factorization enables us to derive an explicit and tight characterization of the multi-epoch error. We further prove that BISR achieves asymptotically optimal error by matching the upper and lower bounds. Empirically, BISR performs on par with state-of-the-art factorization methods, while being simpler to implement,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 10Confidence 4

Strengths

1. The proposed method, relying on imposing structure on the inverse of SGD's workload matrix squared root, is original and very different from existing ideas. 2. A refined theoretical lower bound on the factorization of SGD's workload matrix is provided. 3. The BISR method is shown to match this upper bound. 4. BISR is shown numerically to provide better factorization in many setting that existing methods, and this improved factorization precision results in improved accuracy over existing meth

Weaknesses

The paper is very interesting, and I mostly remark strengths about the contributions. Nonetheless, there are some minor weaknessses: 1. No theoretical guarantees for DP-SGD under the BISR matrix factorization are provided. 2. While the theoretical claims hint for a large improvement over the BSR method, this does not always show in practice; studying more precisely (i.e., non-asymptotically) the respective behaviour of the two methods may reveal more subtle compromises. 3. The experiments on CIF

Reviewer 02Rating 6Confidence 3

Strengths

* **Theoretical contribution.** The paper introduces and discusses a matrix factorization technique with provable optimality, and refines prior existing bound (Kalinin and Lampert, 2024). Unlike related work, you provide an explicit dependence on the bandwidth $p$ and on the participation $b$, which leads to more useful guarantees. The idea to consider the inverse correlation matrix instead of the matrix itself is, as far as I can tell, novel and elegant. Moreover, the discussion on an efficient

Weaknesses

* **Clarity and accessibility.** The paper has dense notation and long proofs: intuition could be introduced earlier. For instance, the benefits of inverse banding are not intuitively clear, and visualizations could help here. * **Low privacy regime.** In your empirical evaluation, you only present results in a arguably low privacy regime $\epsilon=9$. While this specific value for the privacy budget seems to be common in related literature, it is generally understood to be at the edge of what

Reviewer 03Rating 6Confidence 3

Strengths

I find the idea of bounding the bandwidth of matrix on matrix $C$ to reduce computational complexity interesting and the optimal error bound a solid contribution. Also, I appreciate the authors' efforts make to make a comprehensive discussion and comparison with prior work.

Weaknesses

1. The algorithmic modification are relatively minimal. The general structure of the algorithm directly follow that in (Kalinin & Lampert 2024) which slightly weaken the contribution of this work. 2. More discussion is needed on the approximation error defined in equation (1) . In particular, how is this error related to the convergence error? Would achieving an optimal bound on this error also imply an optimal convergence rate? It would be great if the authors can provide explicit convergence r

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Cryptography and Data Security · Stochastic Gradient Optimization Techniques