Memory-Efficient Differentially Private Training with Gradient Random Projection

Alex Mulrooney; Devansh Gupta; James Flemings; Huanyu Zhang; Murali Annavaram; Meisam Razaviyayn; Xinwei Zhang

arXiv:2506.15588·cs.LG·May 19, 2026

Memory-Efficient Differentially Private Training with Gradient Random Projection

Alex Mulrooney, Devansh Gupta, James Flemings, Huanyu Zhang, Murali Annavaram, Meisam Razaviyayn, Xinwei Zhang

PDF

1 Repo 3 Reviews

TL;DR

DP-GRAPE is a memory-efficient differentially private training method that replaces costly SVD computations with random projections, maintaining utility while significantly reducing memory usage.

Contribution

Introduces DP-GRAPE, a novel DP training approach using random Gaussian projections to eliminate SVD, enabling scalable, memory-efficient privacy-preserving neural network training.

Findings

01

Reduces memory usage by over 63% for Vision Transformers.

02

Achieves over 70% memory reduction when fine-tuning RoBERTa-Large.

03

Scales to large models like OPT with 6.7 billion parameters.

Abstract

Differential privacy (DP) protects sensitive data during neural network training, but standard methods like DP-Adam suffer from high memory overhead due to per-sample gradient clipping, limiting scalability. We introduce DP-GRAPE (Gradient RAndom ProjEction), a DP training method that significantly reduces memory usage while maintaining utility on par with first-order DP approaches. DP-GRAPE is motivated by our finding that privatization flattens the gradient singular value spectrum, making SVD-based projections (as in GaLore (Zhao et al., 2024)) unnecessary. Consequently, DP-GRAPE employs three key components: (1) random Gaussian matrices replace SVD-based subspaces, (2) gradients are privatized after projection, and (3) projection is applied during backpropagation. These contributions eliminate the need for costly SVD computations, enable substantial memory savings, and lead to…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

- using random projections (DP-GRAPE) instead of SVD-based projections, which is memory efficient. - DP-GRAPE (Gradient RAndom ProjEction) achieves a privacy-utility trade-off comparable to DP-SGD. - The margins in the experiments are significant, in terms of the memory reduction, while preserving the accuracy.

Weaknesses

- Comparisons asre not sufficient with SOTA methods, and other subspace methods. - The robustness analysis for failure cases is missing.

Reviewer 02Rating 6Confidence 3

Strengths

The observation about spectral flattening is novel and provides a principled reason to abandon SVD-based projections. The authors provide a theoretical privacy and convergence analysis for DP-GRAPE, which is non-trivial due to the introduction of random projections. Evaluations cover both CV (ViT pre-training) and NLP (RoBERTa, OPT). Achieves large-scale DP training (OPT, 6.7B). Memory savings in training are considerable: it cuts memory by over 63% in Vision Transformer training and 70% in R

Weaknesses

The privacy guarantee under random projections with unbounded entries is described informally. A more rigorous sensitivity or RDP proof sketch is needed. DP-GRAPE’s algorithm is more complex to implement than vanilla DP-SGD/DP-Adam. I'm not sure how practical would be to implement it. No code mentioning.

Reviewer 03Rating 4Confidence 4

Strengths

Originality. The paper advances DP training by coupling project-then-privatize gradient handling with random low-rank projections, motivated by the observation that privatization flattens the gradient spectrum. Quality. The paper provides rigorous theoretical guarantees and offers reproducible implementation details and hyperparameter guidance. Clarity. Figures, tables, and the presentation of the algorithm are clear with consistent notation that makes the method easy to follow. Significance. D

Weaknesses

Limited novelty (main concern). Algorithmically, the core move—projecting gradients into a low-dimensional subspace and then privatizing—is a direct transplant of low-rank / random-projection ideas into the DP setting; the paper does not introduce a fundamentally new optimization principle. On the theory side, the guarantees largely read as an incremental generalization of standard DP-SGD analyses to the projected case. Missing head-to-head experiments with the methods surveyed in Table 1. Tabl

Code & Models

Repositories

alexmul1114/DP_GRAPE
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Stochastic Gradient Optimization Techniques · Machine Learning and Algorithms