Adaptive Methods Are Preferable in High Privacy Settings: An SDE Perspective
Enea Monzio Compagnoni, Alessandro Stanghellini, Rustem Islamov, Aurelien Lucchi, Anastasiia Koloskova

TL;DR
This paper uses stochastic differential equations to analyze how adaptive optimization methods outperform non-adaptive ones like DP-SGD in high-privacy regimes, showing adaptive methods' hyperparameters transfer better across privacy levels.
Contribution
It provides the first SDE-based analysis of private optimizers, revealing the advantages of adaptive methods in high privacy settings and their practical hyperparameter transferability.
Findings
DP-SGD converges with a privacy-utility trade-off of O(1/ε^2)
DP-SignSGD converges with a trade-off of O(1/ε)
Adaptive methods like DP-Adam are more practical across privacy levels
Abstract
Differential Privacy (DP) is becoming central to large-scale training as privacy regulations tighten. We revisit how DP noise interacts with adaptivity in optimization through the lens of stochastic differential equations, providing the first SDE-based analysis of private optimizers. Focusing on DP-SGD and DP-SignSGD under per-example clipping, we show a sharp contrast under fixed hyperparameters: DP-SGD converges at a Privacy-Utility Trade-Off of with speed independent of , while DP-SignSGD converges at a speed linear in with an trade-off, dominating in high-privacy or large batch noise regimes. By contrast, under optimal learning rates, both methods achieve comparable theoretical asymptotic performance; however, the optimal learning rate of DP-SGD scales linearly with , while that of…
Peer Reviews
Decision·ICLR 2026 Poster
- SDE-based analysis of differentially private optimizers, using this framework to expose how DP noise interacts with adaptivity and batch noise. - DP-SGD is shown ito be converged at a speed independent of ε. - DP-SignSGD: its convergence speed scales linearly in ε, while its privacy-utility trade-off scales as O (1/ε)
- The assumptions on SNR (signal-to-noise ratio) are built on linear approximations that are only valid in a high-noise, low-signal regime. - A general Student-t distribution for batch noise is used to capture heavy tails, while it is not used consistently in assumption B.2.. - The experimental validation for Protocol B on the StackOverflow dataset is missing.
- The paper investigates the optimization process of differentially private learning in terms of SDE, which has not been actively investigated. - The authors provide a theoretical analysis of why DP-SGD and DP-SignSGD differ in training dynamics, especially with hyperparameter setups. - Based on their observations, the authors argue two protocols that cover both fixed and tuning parameters.
Please refer to the Questions section.
- The paper is well written, seems to be of very high quality. - The fact that the SDE view is able to capture the experimental behaviour that DP-SGD has $\varepsilon^{-2}$ behaviour for small $\varepsilon$-values whereas DP-Adam and DP-SignSGD have $\varepsilon^{-1}$ behaviour (see Figure 1) is very impressive. - The SDE view is well motivated and also commonly considered in the literature (e.g., Blei et al. 2018).
- The paper focuses on only on few adaptive optimizers, and I am a bit surprised about their choices: DP-Adam (Adam with DP gradients) and DP-SignSGD (which is not that well-known). The reason might be that the analysis is amenable for them (questions below), and I think the contribution is very valuable neertheless. - Due to the fact that very few adaptive optimizers seem to actually fit into this SDE framework (or can be seen as discretizations of SDEs, meaning that the weakly converge to the
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Stochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research
