Back to Square Roots: An Optimal Bound on the Matrix Factorization Error for Multi-Epoch Differentially Private SGD
Nikita P. Kalinin, Ryan McKenna, Jalaj Upadhyay, Christoph H. Lampert

TL;DR
This paper introduces BISR, a new matrix factorization method for multi-epoch differentially private SGD, providing tight error bounds and matching optimality, with practical efficiency and simplicity.
Contribution
The paper presents BISR, a novel banded inverse square root factorization that achieves asymptotically optimal error bounds for multi-epoch differential privacy, improving theoretical understanding and practical implementation.
Findings
BISR matches the theoretical optimal error bounds.
BISR performs comparably to state-of-the-art methods empirically.
BISR is simpler, more efficient, and easier to analyze.
Abstract
Matrix factorization mechanisms for differentially private training have emerged as a promising approach to improve model utility under privacy constraints. In practical settings, models are typically trained over multiple epochs, requiring matrix factorizations that account for repeated participation. Existing theoretical upper and lower bounds on multi-epoch factorization error leave a significant gap. In this work, we introduce a new explicit factorization method, Banded Inverse Square Root (BISR), which imposes a banded structure on the inverse correlation matrix. This factorization enables us to derive an explicit and tight characterization of the multi-epoch error. We further prove that BISR achieves asymptotically optimal error by matching the upper and lower bounds. Empirically, BISR performs on par with state-of-the-art factorization methods, while being simpler to implement,…
Peer Reviews
Decision·ICLR 2026 Poster
1. The proposed method, relying on imposing structure on the inverse of SGD's workload matrix squared root, is original and very different from existing ideas. 2. A refined theoretical lower bound on the factorization of SGD's workload matrix is provided. 3. The BISR method is shown to match this upper bound. 4. BISR is shown numerically to provide better factorization in many setting that existing methods, and this improved factorization precision results in improved accuracy over existing meth
The paper is very interesting, and I mostly remark strengths about the contributions. Nonetheless, there are some minor weaknessses: 1. No theoretical guarantees for DP-SGD under the BISR matrix factorization are provided. 2. While the theoretical claims hint for a large improvement over the BSR method, this does not always show in practice; studying more precisely (i.e., non-asymptotically) the respective behaviour of the two methods may reveal more subtle compromises. 3. The experiments on CIF
* **Theoretical contribution.** The paper introduces and discusses a matrix factorization technique with provable optimality, and refines prior existing bound (Kalinin and Lampert, 2024). Unlike related work, you provide an explicit dependence on the bandwidth $p$ and on the participation $b$, which leads to more useful guarantees. The idea to consider the inverse correlation matrix instead of the matrix itself is, as far as I can tell, novel and elegant. Moreover, the discussion on an efficient
* **Clarity and accessibility.** The paper has dense notation and long proofs: intuition could be introduced earlier. For instance, the benefits of inverse banding are not intuitively clear, and visualizations could help here. * **Low privacy regime.** In your empirical evaluation, you only present results in a arguably low privacy regime $\epsilon=9$. While this specific value for the privacy budget seems to be common in related literature, it is generally understood to be at the edge of what
I find the idea of bounding the bandwidth of matrix on matrix $C$ to reduce computational complexity interesting and the optimal error bound a solid contribution. Also, I appreciate the authors' efforts make to make a comprehensive discussion and comparison with prior work.
1. The algorithmic modification are relatively minimal. The general structure of the algorithm directly follow that in (Kalinin & Lampert 2024) which slightly weaken the contribution of this work. 2. More discussion is needed on the approximation error defined in equation (1) . In particular, how is this error related to the convergence error? Would achieving an optimal bound on this error also imply an optimal convergence rate? It would be great if the authors can provide explicit convergence r
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Cryptography and Data Security · Stochastic Gradient Optimization Techniques
