Adaptive Methods Are Preferable in High Privacy Settings: An SDE Perspective

Enea Monzio Compagnoni; Alessandro Stanghellini; Rustem Islamov; Aurelien Lucchi; Anastasiia Koloskova

arXiv:2603.03226·cs.LG·March 4, 2026

Adaptive Methods Are Preferable in High Privacy Settings: An SDE Perspective

Enea Monzio Compagnoni, Alessandro Stanghellini, Rustem Islamov, Aurelien Lucchi, Anastasiia Koloskova

PDF

Open Access 3 Reviews

TL;DR

This paper uses stochastic differential equations to analyze how adaptive optimization methods outperform non-adaptive ones like DP-SGD in high-privacy regimes, showing adaptive methods' hyperparameters transfer better across privacy levels.

Contribution

It provides the first SDE-based analysis of private optimizers, revealing the advantages of adaptive methods in high privacy settings and their practical hyperparameter transferability.

Findings

01

DP-SGD converges with a privacy-utility trade-off of O(1/ε^2)

02

DP-SignSGD converges with a trade-off of O(1/ε)

03

Adaptive methods like DP-Adam are more practical across privacy levels

Abstract

Differential Privacy (DP) is becoming central to large-scale training as privacy regulations tighten. We revisit how DP noise interacts with adaptivity in optimization through the lens of stochastic differential equations, providing the first SDE-based analysis of private optimizers. Focusing on DP-SGD and DP-SignSGD under per-example clipping, we show a sharp contrast under fixed hyperparameters: DP-SGD converges at a Privacy-Utility Trade-Off of $O (1/ ε^{2})$ with speed independent of $ε$ , while DP-SignSGD converges at a speed linear in $ε$ with an $O (1/ ε)$ trade-off, dominating in high-privacy or large batch noise regimes. By contrast, under optimal learning rates, both methods achieve comparable theoretical asymptotic performance; however, the optimal learning rate of DP-SGD scales linearly with $ε$ , while that of…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

- SDE-based analysis of differentially private optimizers, using this framework to expose how DP noise interacts with adaptivity and batch noise. - DP-SGD is shown ito be converged at a speed independent of ε. - DP-SignSGD: its convergence speed scales linearly in ε, while its privacy-utility trade-off scales as O (1/ε)

Weaknesses

- The assumptions on SNR (signal-to-noise ratio) are built on linear approximations that are only valid in a high-noise, low-signal regime. - A general Student-t distribution for batch noise is used to capture heavy tails, while it is not used consistently in assumption B.2.. - The experimental validation for Protocol B on the StackOverflow dataset is missing.

Reviewer 02Rating 6Confidence 4

Strengths

- The paper investigates the optimization process of differentially private learning in terms of SDE, which has not been actively investigated. - The authors provide a theoretical analysis of why DP-SGD and DP-SignSGD differ in training dynamics, especially with hyperparameter setups. - Based on their observations, the authors argue two protocols that cover both fixed and tuning parameters.

Weaknesses

Please refer to the Questions section.

Reviewer 03Rating 6Confidence 3

Strengths

- The paper is well written, seems to be of very high quality. - The fact that the SDE view is able to capture the experimental behaviour that DP-SGD has $\varepsilon^{-2}$ behaviour for small $\varepsilon$-values whereas DP-Adam and DP-SignSGD have $\varepsilon^{-1}$ behaviour (see Figure 1) is very impressive. - The SDE view is well motivated and also commonly considered in the literature (e.g., Blei et al. 2018).

Weaknesses

- The paper focuses on only on few adaptive optimizers, and I am a bit surprised about their choices: DP-Adam (Adam with DP gradients) and DP-SignSGD (which is not that well-known). The reason might be that the analysis is amenable for them (questions below), and I think the contribution is very valuable neertheless. - Due to the fact that very few adaptive optimizers seem to actually fit into this SDE framework (or can be seen as discretizations of SDEs, meaning that the weakly converge to the

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Stochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research