Differentially Private Algorithms for the Stochastic Saddle Point   Problem with Optimal Rates for the Strong Gap

Raef Bassily; Crist\'obal Guzm\'an; Michael Menart

arXiv:2302.12909·cs.LG·June 30, 2023

Differentially Private Algorithms for the Stochastic Saddle Point Problem with Optimal Rates for the Strong Gap

Raef Bassily, Crist\'obal Guzm\'an, Michael Menart

PDF

Open Access

TL;DR

This paper develops differentially private algorithms for stochastic saddle point problems, achieving near-optimal convergence rates and analyzing the tradeoff between stability and accuracy in such settings.

Contribution

It introduces a novel recursive regularization technique for saddle point problems under differential privacy constraints, achieving optimal rates and providing a general algorithm framework.

Findings

01

Achieves nearly optimal strong gap rates of rac{1}{\u221a{n}} + rac{\u221a{d}}{npsilon}

02

Develops a general algorithm with rac{ ext{min}igrac{n^2\u2215 ext{epsilon}^{1.5}}{\u221a{d}}, n^{3/2}ig)} gradient complexity

03

Establishes a fundamental tradeoff between stability and accuracy in differentially private algorithms.

Abstract

We show that convex-concave Lipschitz stochastic saddle point problems (also known as stochastic minimax optimization) can be solved under the constraint of $(ϵ, δ)$ -differential privacy with \emph{strong (primal-dual) gap} rate of $\tilde{O} (\frac{1}{n} + \frac{d}{n ϵ})$ , where $n$ is the dataset size and $d$ is the dimension of the problem. This rate is nearly optimal, based on existing lower bounds in differentially private stochastic optimization. Specifically, we prove a tight upper bound on the strong gap via novel implementation and analysis of the recursive regularization technique repurposed for saddle point problems. We show that this rate can be attained with $O (min {\frac{n ^{2} ϵ ^{1.5}}{d}, n^{3/2}})$ gradient complexity, and $\tilde{O} (n)$ gradient complexity if the loss function is smooth. As a byproduct of…

Equations150

\min_{w\in\mathcal{W}}\max_{\theta\in\Theta}\Big{\{}F_{\cal D}(w,\theta):=\mathbb{E}_{x\sim{\cal D}}[f(w,\theta;x)]\Big{\}},

\min_{w\in\mathcal{W}}\max_{\theta\in\Theta}\Big{\{}F_{\cal D}(w,\theta):=\mathbb{E}_{x\sim{\cal D}}[f(w,\theta;x)]\Big{\}},

Gap (A)

Gap (A)

Gap_{weak} (A)

∣ f (w_{1}, θ_{1}; x) - f (w_{2}, θ_{2}; x) ∣ \leq L ∥ [w_{1}, θ_{1}] - [w_{2}, θ_{2}] ∥

∣ f (w_{1}, θ_{1}; x) - f (w_{2}, θ_{2}; x) ∣ \leq L ∥ [w_{1}, θ_{1}] - [w_{2}, θ_{2}] ∥

\nabla_{[w, θ]} f (w_{1}, θ_{1}; x) - \nabla_{[w, θ]} f (w_{2}, θ_{2}; x) \leq β ∥ [w_{1}, θ_{1}] - [w_{2}, θ_{2}] ∥ .

(w, θ) \mapsto \frac{1}{n} z \in S \sum f (w, θ; z) + \frac{λ}{2} ∥ w - \overset{w}{^} ∥^{2} - \frac{λ}{2} ∥ θ - \hat{θ} ∥^{2} .

(w, θ) \mapsto \frac{1}{n} z \in S \sum f (w, θ; z) + \frac{λ}{2} ∥ w - \overset{w}{^} ∥^{2} - \frac{λ}{2} ∥ θ - \hat{θ} ∥^{2} .

Gap (R) = O (lo g (n) B \overset{α}{^} + \frac{lo g ^{3/2} ( n ) B L}{n}) .

Gap (R) = O (lo g (n) B \overset{α}{^} + \frac{lo g ^{3/2} ( n ) B L}{n}) .

Gap (A, S E [A_{w} (S)], A, S E [A_{θ} (S)]) \leq Gap_{weak} (A) \leq S E [Gap_{S} (A)] + Δ L .

Gap (A, S E [A_{w} (S)], A, S E [A_{θ} (S)]) \leq Gap_{weak} (A) \leq S E [Gap_{S} (A)] + Δ L .

Gap (w_{T}^{*}, θ_{T}^{*})

Gap (w_{T}^{*}, θ_{T}^{*})

G_{D} (w_{T}^{*}, θ_{T}^{*})

G_{D} (w_{T}^{*}, θ_{T}^{*})

E [Gap (w_{T}^{*}, θ_{T}^{*})] \leq 4 B \cdot E [λ t = 0 \sum T - 1 2^{t} [w_{T}^{*}, θ_{T}^{*}] - [\overset{w}{ˉ}_{t}, \overset{ˉ}{θ}_{t}]]

E [Gap (w_{T}^{*}, θ_{T}^{*})] \leq 4 B \cdot E [λ t = 0 \sum T - 1 2^{t} [w_{T}^{*}, θ_{T}^{*}] - [\overset{w}{ˉ}_{t}, \overset{ˉ}{θ}_{t}]]

\leq (i) 4 B \cdot E [λ t = 0 \sum T - 1 2^{t} ([w_{t + 1}^{*}, θ_{t + 1}^{*}] - [\overset{w}{ˉ}_{t}, \overset{ˉ}{θ}_{t}] + r = t + 1 \sum T - 1 [w_{r + 1}^{*}, θ_{r + 1}^{*}] - [w_{r}^{*}, θ_{r}^{*}])]

= (ii) O (B t = 0 \sum T - 1 2^{t} λ E [[w_{t + 1}^{*}, θ_{t + 1}^{*}] - [\overset{w}{ˉ}_{t}, \overset{ˉ}{θ}_{t}]] + B t = 1 \sum T - 1 2^{t} λ E [[\overset{w}{ˉ}_{t}, \overset{ˉ}{θ}_{t}] - [w_{t}^{*}, θ_{t}^{*}]])

= O (B t = 0 \sum T - 1 2^{t} λ \frac{B}{2 ^{t}} + B r = 1 \sum T - 1 2^{t} λ \frac{B}{2 ^{t}}) = O (T λ B^{2}) = O (\frac{lo g _{2} ( n ) B L}{n ^{'}}),

Gap (R) = O (\frac{lo g ^{3/2} ( n ) B L}{n} + \frac{lo g ^{2} ( n ) B L d lo g ( 1/ δ )}{n ϵ}) .

Gap (R) = O (\frac{lo g ^{3/2} ( n ) B L}{n} + \frac{lo g ^{2} ( n ) B L d lo g ( 1/ δ )}{n ϵ}) .

Gap (R) = O (\frac{lo g ^{3/2} ( n ) B L}{n} + \frac{lo g ^{2} ( n ) B L d lo g ( 1/ δ )}{n ϵ}),

Gap (R) = O (\frac{lo g ^{3/2} ( n ) B L}{n} + \frac{lo g ^{2} ( n ) B L d lo g ( 1/ δ )}{n ϵ}),

Gap (R) = O (lo g (n) B^{2} λ) = O (\frac{lo g ^{3/2} ( n ) B L}{n} + \frac{lo g ^{2} ( n ) B L d lo g ( 2/ δ )}{n ϵ}) .

Gap (R) = O (lo g (n) B^{2} λ) = O (\frac{lo g ^{3/2} ( n ) B L}{n} + \frac{lo g ^{2} ( n ) B L d lo g ( 2/ δ )}{n ϵ}) .

E [[\overset{w}{ˉ}_{t}, \overset{ˉ}{θ}_{t}] - [w_{S, t}^{*}, θ_{S, t}^{*}]^{2}]

E [[\overset{w}{ˉ}_{t}, \overset{ˉ}{θ}_{t}] - [w_{S, t}^{*}, θ_{S, t}^{*}]^{2}]

\leq d σ_{t}^{2} + (\frac{δ}{5} \cdot \frac{L}{2 ^{t} λ n ^{'}})^{2}

\leq \frac{64 d L ^{2} lo g ( 2/ δ )}{2 ^{2 t} λ ^{2} ( n ^{'} ) ^{2} ϵ ^{2}} + \frac{B ^{2}}{25 \cdot 2 ^{2 t}} \leq \frac{B ^{2}}{12 \cdot 2 ^{2 t}} .

\frac{δ ^{2}}{25} \cdot \frac{L ^{2}}{2 ^{2 t} λ n ^{'2}} = Ω (\frac{δ ^{2} L ^{2}}{2 ^{2 T} λ ( n ^{'} ) ^{2}}) = Ω (\frac{B ^{2} λ ^{2}}{L ^{2}} \frac{δ ^{2} L ^{2}}{λ ( n ^{'} ) ^{2}}) = Ω (\frac{δ ^{2} B L}{n ^{2.5}})

\frac{δ ^{2}}{25} \cdot \frac{L ^{2}}{2 ^{2 t} λ n ^{'2}} = Ω (\frac{δ ^{2} L ^{2}}{2 ^{2 T} λ ( n ^{'} ) ^{2}}) = Ω (\frac{B ^{2} λ ^{2}}{L ^{2}} \frac{δ ^{2} L ^{2}}{λ ( n ^{'} ) ^{2}}) = Ω (\frac{δ ^{2} B L}{n ^{2.5}})

Gap (\overset{w}{ˉ}, \overset{ˉ}{θ}) - Gap (\overset{w}{ˉ}^{'}, \overset{ˉ}{θ}^{'})

Gap (\overset{w}{ˉ}, \overset{ˉ}{θ}) - Gap (\overset{w}{ˉ}^{'}, \overset{ˉ}{θ}^{'})

\nabla_{t} = g (w_{t - 1}, θ_{t - 1}; x_{t}) + ξ_{t}

\nabla_{t} = g (w_{t - 1}, θ_{t - 1}; x_{t}) + ξ_{t}

Gap (A, S E [A_{w} (S)], A, S E [A_{θ} (S)])

Gap (A, S E [A_{w} (S)], A, S E [A_{θ} (S)])

\displaystyle=\max_{\theta\in\Theta}\left\{{F_{\mathcal{D}}\Big{(}\underset{\hat{S}\sim\mathcal{D}^{n},\mathcal{A}_{w}}{\mathbb{E}}\left[\mathcal{A}_{w}(\hat{S})\right],\theta\Big{)}}\right\}-\min_{w\in\mathcal{W}}\left\{{{F_{\mathcal{D}}\Big{(}w,\underset{\hat{S}\sim\mathcal{D}^{n},\mathcal{A}_{\theta}}{\mathbb{E}}\left[\mathcal{A}_{\theta}(\hat{S})\right]\Big{)}}}\right\}

\leq θ \in Θ max {\hat{S} \sim D^{n}, A_{w} E [F_{D} (A_{w} (\hat{S}), θ)]} - w \in W min {\hat{S} \sim D^{n}, A_{θ} E [F_{D} (w, A_{θ} (\hat{S}))]}

= Gap_{weak} (A) .

\displaystyle\operatorname{Gap}(\mathcal{R})=O\Big{(}\log(n)B^{2}\lambda\Big{)}

\displaystyle\operatorname{Gap}(\mathcal{R})=O\Big{(}\log(n)B^{2}\lambda\Big{)}

E_{t} = E [[\overset{w}{ˉ}_{t}, \overset{ˉ}{θ}_{t}] - [w_{S, t}^{*}, θ_{S, t}^{*}]^{2}],

E_{t} = E [[\overset{w}{ˉ}_{t}, \overset{ˉ}{θ}_{t}] - [w_{S, t}^{*}, θ_{S, t}^{*}]^{2}],

E [[\overset{w}{ˉ}_{t}, \overset{ˉ}{θ}_{t}] - [w_{t}^{*}, θ_{t}^{*}]^{2}]

E [[\overset{w}{ˉ}_{t}, \overset{ˉ}{θ}_{t}] - [w_{t}^{*}, θ_{t}^{*}]^{2}]

\displaystyle\leq 3\Bigg{(}\underset{}{\mathbb{E}}\left[\left\|[\bar{w}_{t},\bar{\theta}_{t}]-[w^{*}_{S,t},\theta^{*}_{S,t}]\right\|^{2}\right]+\underset{}{\mathbb{E}}\left[\left\|[w^{*}_{S,t},\theta^{*}_{S,t}]-[\widetilde{w}_{t},\widetilde{\theta}_{t}]\right\|^{2}\right]+\underset{}{\mathbb{E}}\left[\left\|[\widetilde{w}_{t},\widetilde{\theta}_{t}]-[w_{t}^{*},\theta_{t}^{*}]\right\|^{2}\right]\Bigg{)}

\displaystyle\leq 3\Bigg{(}\underbrace{\underset{}{\mathbb{E}}\left[\left\|[\bar{w}_{t},\bar{\theta}_{t}]-[w^{*}_{S,t},\theta^{*}_{S,t}]\right\|^{2}\right]}_{E_{t}}+\underbrace{\underset{}{\mathbb{E}}\left[\left\|[w^{*}_{S,t},\theta^{*}_{S,t}]-[\widetilde{w}_{t},\widetilde{\theta}_{t}]\right\|^{2}\right]}_{F_{t}}+\underbrace{\frac{1}{2^{t}\lambda}\underset{}{\mathbb{E}}\left[\widehat{\operatorname{Gap}}^{(t)}\left({\widetilde{w}_{t},\widetilde{\theta}_{t}}\right)\right]}_{G_{t}}\Bigg{)}.

E [[w_{S, t}^{*}, θ_{S, t}^{*}] - [w_{t}, θ_{t}]^{2}] \leq \frac{L ^{2}}{2 ^{2 t} λ ^{2} n ^{'}} \leq \frac{B ^{2} L ^{2}}{2304 \cdot 2 ^{2 t} ( L / n ^{'} ) ^{2} n ^{'}} = \frac{B ^{2}}{2304 \cdot 2 ^{2 t}} .

E [[w_{S, t}^{*}, θ_{S, t}^{*}] - [w_{t}, θ_{t}]^{2}] \leq \frac{L ^{2}}{2 ^{2 t} λ ^{2} n ^{'}} \leq \frac{B ^{2} L ^{2}}{2304 \cdot 2 ^{2 t} ( L / n ^{'} ) ^{2} n ^{'}} = \frac{B ^{2}}{2304 \cdot 2 ^{2 t}} .

\frac{1}{2 ^{t} λ} E [Gap^{(t)} (w_{t}, θ_{t})]

\frac{1}{2 ^{t} λ} E [Gap^{(t)} (w_{t}, θ_{t})]

\displaystyle\leq\frac{1}{2^{t}\lambda}\Big{(}\underset{}{\mathbb{E}}\left[\underset{}{\mathbb{E}}\left[\widehat{\operatorname{Gap}}_{S}^{(t)}\left({w^{*}_{S,t},\theta^{*}_{S,t}}\right)\Big{|}\mathcal{F}_{t-1}\right]\right]+\frac{L^{2}}{2^{t}\lambda n^{\prime}}\Big{)}

= \frac{L ^{2}}{2 ^{2 t} λ ^{2} n ^{'}} \leq \frac{B ^{2}}{2304 \cdot 2 ^{2 t}} .

E [[w_{t}^{*}, θ_{t}^{*}] - [\overset{w}{ˉ}_{t - 1}, \overset{ˉ}{θ}_{t - 1}]^{2}]

E [[w_{t}^{*}, θ_{t}^{*}] - [\overset{w}{ˉ}_{t - 1}, \overset{ˉ}{θ}_{t - 1}]^{2}]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Sparse and Compressive Sensing Techniques

Full text

Differentially Private Algorithms for the Stochastic Saddle Point Problem with Optimal Rates for the Strong Gap

Raef Bassily Cristóbal Guzmán Department of Computer Science & Engineering and the Translational Data Analytics Institute (TDAI), The Ohio State University, [email protected]Institute for Mathematical and Computational Engineering, Faculty of Mathematics and School of Engineering, Pontificia Universidad Católica de Chile, [email protected]

Michael Menart Department of Computer Science & Engineering, The Ohio State University, [email protected]

Abstract

We show that convex-concave Lipschitz stochastic saddle point problems (also known as stochastic minimax optimization) can be solved under the constraint of $(\epsilon,\delta)$ -differential privacy with strong (primal-dual) gap rate of $\tilde{O}\big{(}\frac{1}{\sqrt{n}}+\frac{\sqrt{d}}{n\epsilon}\big{)}$ , where $n$ is the dataset size and $d$ is the dimension of the problem. This rate is nearly optimal, based on existing lower bounds in differentially private stochastic convex optimization. Specifically, we prove a tight upper bound on the strong gap via novel implementation and analysis of the recursive regularization technique repurposed for saddle point problems. We show that this rate can be attained with $O\big{(}\min\big{\{}\frac{n^{2}\epsilon^{1.5}}{\sqrt{d}},n^{3/2}\big{\}}\big{)}$ gradient complexity, and $\tilde{O}(n)$ gradient complexity if the loss function is smooth. As a byproduct of our method, we develop a general algorithm that, given a black-box access to a subroutine satisfying a certain $\alpha$ primal-dual accuracy guarantee with respect to the empirical objective, gives a solution to the stochastic saddle point problem with a strong gap of $\tilde{O}(\alpha+\frac{1}{\sqrt{n}})$ . We show that this $\alpha$ -accuracy condition is satisfied by standard algorithms for the empirical saddle point problem such as the proximal point method and the stochastic gradient descent ascent algorithm. Finally, to emphasize the importance of the strong gap as a convergence criterion compared to the weaker notion of primal-dual gap, commonly known as the weak gap, we show that even for simple problems it is possible for an algorithm to have zero weak gap and suffer from $\Omega(1)$ strong gap. We also show that there exists a fundamental tradeoff between stability and accuracy. Specifically, we show that any $\Delta$ -stable algorithm has empirical gap $\Omega\big{(}\frac{1}{\Delta n}\big{)}$ , and that this bound is tight. This result also holds also more specifically for empirical risk minimization problems and may be of independent interest.

1 Introduction

Stochastic (convex-concave) saddle point problems (SSP)111In this work, we will exclusively focus on the case where the function of interest for the stochastic saddle-point problem is convex-concave, and therefore we will omit it from the problem denomination. (also referred to in the literature as stochastic minimax optimization problems) are an increasingly important model for modern machine learning, arising in areas such as stochastic optimization [27, 19, 39], robust statistics [37], and algorithmic fairness [25, 35].

On the other hand, the reliance of modern machine learning on large datasets has led to concerns of user privacy. These concerns in turn have led to a variety of privacy standards, of which differential privacy (DP) has become the premier standard. However, for a variety of machine learning problems it is known that their differentially-private counterparts have provably worse rates. As such, characterizing the fundamental cost of differential privacy has become an important problem.

Currently, the theory of solving SSPs under differential privacy has major limitations, compared to its non-private counterpart. To illustrate this point, we need to discuss the notions of accuracy used in the literature. In SSPs, the goal is to find an approximate solution of the problem

[TABLE]

where ${\cal D}$ is an unknown distribution for which we have access to an i.i.d. sample $S$ . Given a (randomized) algorithm ${\cal A}$ with output $[{\cal A}_{w}(S),{\cal A}_{\theta}(S)]\in\mathcal{W}\times\Theta$ , two studied measures of performance are the strong and weak gap222The weak gap is sometimes stated with $\mathbb{E}_{\mathcal{A}}[\cdot]$ taken inside the max. However [7] showed this was not necessary to obtain the stability implies generalization result used in various works., defined respectively as

[TABLE]

It is easy to see that the strong gap upper bounds the weak gap, and thus it is a stronger accuracy measure. On the other hand, even for simple problems, the difference between these measures can be $\Omega(1)$ ; a fact we elaborate on in Section 5. We also note that the strong gap has a clear game-theoretic interpretation: if we consider ${\cal A}_{w}(S)$ and ${\cal A}_{\theta}(S)$ as the actions of two players in a (stochastic) zero-sum game, the strong gap upper bounds the most profitable unilateral deviation for either of the two players. In game theory this is known as an approximate Nash equilibrium. By contrast, there is no general guarantee associated with the weak gap.

Non-privately, it is known how to achieve optimal rates w.r.t. the strong gap, and those rates are similar to those established for stochastic convex optimization (SCO) [27, 19]. However, for DP methods optimal rates are only known for the weak gap [7, 36, 40]. In a nutshell, the main limitation of these approaches is that –in order to amplify privacy– they make multiple passes over the data (e.g., by sampling with replacement stochastic gradients from the dataset), and the existing theory of generalization for SSPs is much more limited than it is for SCO [38, 23, 30]. Our approach largely circumvents the current limitations of generalization theory for SSPs, providing the first nearly-optimal rates for the strong gap in DP-SSP.

1.1 Contributions

In this work, we establish the optimal rates on the strong gap for DP-SSP. In the following, we let $n$ be the number of samples, $d$ be the dimension, and $\epsilon,\delta$ be the privacy parameters. Our main result is an $(\epsilon,\delta)$ -DP algorithm for SSP whose strong gap is $\tilde{O}\big{(}\frac{1}{\sqrt{n}}+\frac{\sqrt{d}}{n\epsilon}\big{)}$ . This rate is nearly optimal, due to matching lower bounds for differentially private SCO [9, 6]. These minimization lower bounds hold for saddle point problems since minimization problems are a special case of saddle point problems when $\Theta$ is constrained to be a singleton. For non-smooth loss function, we show this rate can be obtained in gradient complexity $O\big{(}\min\big{\{}\frac{n^{2}\epsilon^{1.5}}{\sqrt{d}},n^{3/2}\big{\}}\big{)}$ . This improves even upon the previous best known running time for achieving analogous rates on the weak gap, which was $n^{5/2}$ [36]. Furthermore, we show that if the loss function is smooth, this rate can be achieved in nearly linear gradient complexity.

In order to obtain an upper bound for this problem, we present a novel analysis of the recursive regularization algorithm of [4]. Our work is the first to show how the sequential regularization approach can be repurposed to provide an algorithmic framework for attaining optimal strong gap guarantees for DP-SSP. As a byproduct of our analysis, we show that empirical saddle point solvers which satisfy a certain $\alpha$ accuracy guarantee can be used as a black box to obtain an $\tilde{O}\left({\alpha+1/\sqrt{n}}\right)$ guarantee on the strong (population) gap. This class of algorithms includes common techniques such as the proximal point method, the extragradient method, and stochastic gradient descent ascent (SGDA) [24, 26, 19]. This fact may be of interest independent of differential privacy, as to the best of our knowledge, existing algorithms which achieve the optimal $1/{\sqrt{n}}$ rate on the strong population gap rely crucially on a one-pass structure which optimizes the population gap directly [27].

Under the additional assumption that the loss function is smooth, we show that it is possible to use recursive regularization to obtain the optimal strong gap rate in nearly linear time. We here leverage accelerated algorithms for smooth and strongly convex/strongly concave loss functions [31, 20].

Our results stand in contrast to previous work on DP-SSPs, which has achieved optimal rates only for the weak gap and has crucially relied on “stability implies generalization” results for the weak gap. In this vein, we prove that even for simple problems, the strong and weak gap may differ by $\Theta(1)$ . We also elucidate the challenges of extending existing techniques to strong gap guarantees by showing a fundamental tradeoff between stability and empirical accuracy. Specifically, we show that even for the more specific case of empirical risk minimization, any algorithm which is $\Delta$ -uniform argument stable algorithm must have empirical risk $\Omega\left({\frac{1}{\Delta n}}\right)$ . We also show this bound is tight, and note that it may be of independent interest. Such a tradeoff was also investigated by [10], but their result only implies such a tradeoff for the specific case of $\Delta=\frac{1}{\sqrt{n}}$ and their proof technique is unrelated to ours.

1.2 Related Work

Differentially private stochastic optimization has been extensively studied for over a decade [18, 9, 21, 34, 6, 13, 3, 8]. Among such problems, stochastic convex minimization (where problem parameters are measured in the $\ell_{2}$ -norm) is perhaps the most widely studied, where it is known the optimal rate is $\tilde{O}(\frac{1}{\sqrt{n}}+\frac{\sqrt{d}}{n\epsilon})$ [6, 9]. Further, under smoothness assumptions such rates can be obtained in linear (in the sample size) gradient complexity [14]. Without smoothness, no linear time algorithms which achieve the optimal rates are known [22].

The study of stochastic saddle point problems under differential privacy is comparatively newer. In the non-private setting, optimal $O(1/\sqrt{n})$ guarantees on the strong gap have been known as far back as [29]. Under privacy (without strong convexity/strong concavity), optimal rates are known only for the weak gap. These rates $\tilde{O}(\frac{1}{\sqrt{n}}+\frac{\sqrt{d}}{n\epsilon})$ have been obtained by several works [7, 36, 40]. The work of [40] additionally showed that under smoothness assumptions such a result could be obtained in near linear gradient complexity by leveraging accelerated methods [20, 31]. All of these results are for the weak gap and they rely crucially on the fact that, for the weak gap, $\Delta$ -stability implies $\Delta$ -generalization [38].

By contrast, for the strong gap (without strong convexity/strong concavity assumptions), the best stability implies generalization result is a $\sqrt{\Delta}$ bound obtained by [30] provided the loss is smooth. As a result of this discrepancy, known bounds on the strong gap under privacy are worse. The best known rates for the strong gap are $O\left({\min\left({\frac{d^{1/4}}{\sqrt{n\epsilon}},\frac{1}{n^{1/3}}+\frac{\sqrt{d}}{n^{2/3}\epsilon}}\right)}\right)$ [7]. This rate was obtained through of mixture of noisy stochastic extragradient and noisy inexact proximal point methods, avoiding stability arguments altogether and instead relying on one-pass algorithms which optimize the population loss directly. Without smoothness, we are not aware of any work which provides bounds on the strong gap under privacy, but one may note that a straightforward implementation of one-pass noisy SGDA leads to a rate of $O\big{(}\frac{\sqrt{d}}{\sqrt{n}\epsilon}\big{)}$ in this setting. We give these details in Appendix A.2 and note this same algorithm establishes the optimal rate for SSPs under local differential privacy.

Finally, under the stringent assumptions of $\mu$ -strong convexity/strong concavity ( $\mu$ -SC/SC) and smoothness with constant condition number, $\kappa$ , optimal rates on the strong gap have been obtained [40]. Under these assumptions, the optimal rate of $O\big{(}\frac{1}{\mu n}+\frac{d}{\mu n^{2}\epsilon^{2}}\big{)}$ was achieved by leveraging the fact that $\Delta$ stability implies $\kappa\Delta$ generalization [38]. The lower bound for this rate comes from lower bounds for the minimization setting [17, 6].

2 Preliminaries

Throughout, we consider the space $\mathbb{R}^{d}$ endowed with the standard $\ell_{2}$ norm $\|\cdot\|$ . Let the primal parameter space $\mathcal{W}$ and the dual parameter space $\Theta$ be compact convex sets such that $\mathcal{W}\times\Theta\subset\mathbb{R}^{d}$ for some $d>0$ . Let $\mathcal{D}$ be some distribution over data domain $\mathcal{X}$ . Consider the stochastic saddle-point problem given in equation (1) for some loss function $f$ that is convex w.r.t. $w$ and concave w.r.t. $\theta$ . We define the corresponding population loss and empirical loss functions as $F_{\mathcal{D}}(w,\theta)=\underset{x\sim\mathcal{D}}{\mathbb{E}}\left[f(w,\theta;x)\right]$ and $F_{S}(w,\theta)=\frac{1}{n}\sum_{x\in S}f(w,\theta;x)$ respectively. For some $B>0$ we assume that $\max_{u,u^{\prime}\in\mathcal{W}\times\Theta}\left\|u-u^{\prime}\right\|\leq B$ . To simplify notation, for vectors $w\in\mathcal{W}$ and $\theta\in\Theta$ , we will use $[w,\theta]$ to denote their concatenation, noting $[w,\theta]$ is a vector in $\mathbb{R}^{d}$ . We primarily consider the case where $f$ is $L$ -Lipschitz, but will also consider the additional assumption of $\beta$ -smoothness for certain results333Throughout, any properties for $f$ are considered as a function of $[w,\theta]$ . No assumptions about $f$ w.r.t. $x$ are made.. Specifically, these assumptions are that $\forall w_{1},w_{2}\in\mathcal{W}$ and $\forall\theta_{1},\theta_{2}\in\Theta$ :

[TABLE]

Under such assumptions (in fact, smoothness is not necessary), a solution for problem (1) always exists [33], which we will call as a saddle point onwards. Further, given an SSP (1), we will denote a saddle point as $[w^{\ast},\theta^{\ast}]$ .

Gap functions

In addition to the strong and weak gap functions defined in equations (2) and (3), it will be useful to define the following gap function expressed as a function of the parameter vector instead of the algorithm, $\widehat{\operatorname{Gap}}(\bar{w},\bar{\theta})=\max_{\theta\in\Theta}\left\{{F_{\mathcal{D}}(\bar{w},\theta)}\right\}-\min_{w\in\mathcal{W}}\left\{{{F_{\mathcal{D}}(w,\bar{\theta})}}\right\}.$

We have the following useful fact regarding $\widehat{\operatorname{Gap}}$ (see Appendix A for a proof).

Fact 1.

If $f$ is $L$ -Lipschitz then $\widehat{\operatorname{Gap}}$ is $\sqrt{2}L$ -Lipschitz.

Note the strong gap can be written as an expectation of the gap function. Further, since the gap function is zero if and only if $(\bar{w},\bar{\theta})$ is a solution for problem (1), the strong gap is considered the most suitable measure of accuracy for SSPs [28, 19]. We also define the empirical gap as, $\operatorname{Gap}_{S}(\mathcal{A})=\underset{\mathcal{A}}{\mathbb{E}}\left[\max_{\theta\in\Theta}\left\{{F_{S}(\mathcal{A}_{w}(S),\theta)}\right\}-\min_{w\in\mathcal{W}}\left\{{{F_{S}(w,\mathcal{A}_{\theta}(S))}}\right\}\right].$ We will consider at various points the notion of generalization error with respect to the strong/weak gap, which refers to difference between the strong/weak gap and the empirical gap. Note that because the empirical gap treats the dataset as a fixed quantity, there are not differing strong and weak versions of the empirical gap.

Saddle Operator

Define the saddle operator as $g(w,\theta;x)=[\nabla_{w}f(w,\theta;x),-\nabla_{\theta}f(w,\theta;x)].$ Similarly define $G_{\mathcal{D}}(w,\theta)=\mathbb{E}_{x\sim\mathcal{D}}[g(w,\theta;x)]$ and $G_{S}(w,\theta)=\frac{1}{n}\sum_{x\in S}g(w,\theta;x)$ . Note that the assumption on the smoothness of $f$ implies the Lipschitzness of $g$ . We note that since the saddle operator can be computed using one computation of the gradient, we refer indistinctly to saddle operator complexity or gradient complexity when discussing the running time of our algorithms.

Stability

We will also use the notion of uniform argument stability frequently in our analysis [5].

Definition 1.

A randomized algorithm $\mathcal{A}:\mathcal{X}^{n}\mapsto\mathcal{W}\times\Theta$ satisfies $\Delta$ -uniform argument stability if for any pair of adjacent datasets $S,S^{\prime}\in\mathcal{X}^{n}$ it holds that $\underset{\mathcal{A}}{\mathbb{E}}\left[\left\|\mathcal{A}(S)-\mathcal{A}(S^{\prime})\right\|\right]\leq\Delta$ .

A fact we will use is that the (constrained) regularized saddle-point is stable. Specifically, for some $\hat{w}\in\mathcal{W}$ , $\hat{\theta}\in\Theta$ , and $\lambda\geq 0$ consider the regularized objective function

[TABLE]

It is easy to see that his problem has a unique saddle point. The mapping which selects its output according the unique solution of (4) has the following stability property.

Lemma 1.

[38, Lemma 1]** The algorithm which outputs the regularized saddle point with parameters $\lambda>0$ , $\hat{w}\in\mathcal{W}$ and $\hat{\theta}\in\Theta$ , is $\big{(}\frac{2L}{\lambda n}\big{)}$ -uniform argument stable w.r.t. $S$ .

In addition to the stability of the regularized saddle point, we will also frequently use the following fact.

Lemma 2.

[38*, Theorem 1]**

Let $h:\mathcal{W}\times\Theta\mapsto\mathbb{R}$ be $\lambda$ -SC/SC with saddle point $[w^{*},\theta^{*}]$ and gap function $\widehat{\operatorname{Gap}}^{h}$ . For any $[w,\theta]\in\mathcal{W}\times\Theta$ it holds that $\left\|[w,\theta]-[w^{*},\theta^{*}]\right\|^{2}\leq\frac{2(h(w,\theta^{*})-h(w^{*},\theta))}{\lambda}\leq\frac{2}{\lambda}\widehat{\operatorname{Gap}}^{h}(w,\theta)$ .*

Differential Privacy (DP) [12]:

An algorithm $\mathcal{A}$ is $(\epsilon,\delta)$ -differentially private if for all datasets $S$ and $S^{\prime}$ differing in one data point and all events $\mathcal{E}$ in the range of the $\mathcal{A}$ , we have, $\mathbb{P}\left({\mathcal{A}(S)\in\mathcal{E}}\right)\leq e^{\epsilon}\mathbb{P}\left({\mathcal{A}(S^{\prime})\in\mathcal{E}}\right)+\delta$ .

3 From Empirical Saddle Point to Strong Gap Guarantee via Recursive Regularization

Our approach for obtaining near optimal rates on the strong gap leverages the recursive regularization technique of [4]. In addition to adapting this algorithm to fit SSP problems, we also provide a novel analysis which differs substantially from the analysis presented in previous work [16, 1].

Our recursive regularization algorithm works by solving a series of regularized objectives, $f^{(1)},...,f^{(T)}$ , with increasingly large regularization parameters. Specifically, after solving the $t$ ’th objective to obtain $[\bar{w}_{t},\bar{\theta}_{t}]$ , the algorithm creates a new objective which is $f^{(t+1)}(w,\theta;x)=f^{(t)}(w,\theta;x)+2^{t+1}\lambda\left\|w-\bar{w}_{t}\right\|^{2}-2^{t+1}\lambda\left\|\theta-\bar{\theta}_{t}\right\|^{2}$ for the subsequent round. Notice that each subsequent objective is easier in the sense that the strong convexity parameter is larger.

Our analysis will leverage the fact that approximate solutions to intermediate objectives do not need to obtain good bounds on the strong gap for the regularization parameter to be increased. This is in contrast to, for example, the iterative regularization technique of [40], which finds $[w,\theta]$ that satisfies a near optimal (weak) gap bound before adding noise.

Empirical Subroutine

Recursive regularization utilizes a subroutine, $\mathcal{A}_{\mathsf{emp}}$ , which is roughly an approximate empirical saddle point solver. In addition to a dataset and Lipschitz loss function, $\mathcal{A}_{\mathsf{emp}}$ takes as input an initial point and a bound, $\hat{D}$ , on the expected distance between the initial point and the saddle point of the empirical loss defined over the input dataset. At round $t\in[T]$ this distance is bounded by $\frac{B}{2^{t}}$ , allowing the algorithm to obtain increasingly strong accuracy guarantees for each subproblem. Note also it can be verified that for all $t\in[T]$ , $f^{(t)}$ is $O(L)$ -Lipschitz due the scaling of the regularization. Specifically, the accuracy guarantee of interest is the following.

Definition 2 ( $\hat{\alpha}$ -relative accuracy).

Given a dataset $S^{\prime}\in\mathcal{X}^{n^{\prime}}$ , loss function $f^{\prime}$ , and an initial point $[w^{\prime},\theta^{\prime}],$ we say that $\mathcal{A}_{\mathsf{emp}}$ satisfies $\hat{\alpha}$ -relative accuracy w.r.t. the empirical saddle point $[w_{S^{\prime}}^{*},\theta_{S^{\prime}}^{*}]$ of $F^{\prime}_{S^{\prime}}(w,\theta)=\frac{1}{n}\sum_{x\in{S^{\prime}}}f^{\prime}(w,\theta;x)$ if, $\forall\hat{D}>0$ , whenever $\underset{}{\mathbb{E}}\left[\left\|[w^{\prime},\theta^{\prime}]-[w^{*}_{S^{\prime}},\theta^{*}_{S^{\prime}}]\right\|\right]\leq\hat{D}$ , the output $[\bar{w},\bar{\theta}]$ of $\mathcal{A}_{\mathsf{emp}}$ satisfies $\underset{}{\mathbb{E}}\left[F^{\prime}_{S^{\prime}}(\bar{w},\theta^{*}_{S^{\prime}})-F^{\prime}_{S^{\prime}}(w^{*}_{S^{\prime}},\bar{\theta})\right]\leq\hat{D}\hat{\alpha}$ .

The relative accuracy guarantee for $\mathcal{A}_{\mathsf{emp}}$ differs from the more standard gap guarantee, and is not necessarily implied by a bound on the empirical gap. The motivation for this notion of accuracy is twofold. First, when the loss function is additionally SC/SC, this guarantee is sufficient to provide a bound on the distance between the output of $\mathcal{A}_{\mathsf{emp}}$ and the saddle point, which will play a crucial role in our convergence proof for Algorithm 1. Second, while it is certainly true that a bound on the empirical gap implies the same bound on $\underset{}{\mathbb{E}}\left[F_{S}(\bar{w},\theta)-F_{S}(w,\bar{\theta})\right]$ , for any given $[w,\theta]$ , it is not necessarily the case that the gap itself may enjoy a bound that is proportional to the initial distance to the saddle point444[15, Theorem 4] claims such a bound on the primal risk, but this is due to a misapplication of [24, Lemma 2].. The reason is that the gap function is defined by a supremum that is taken w.r.t. the whole feasible set $\mathcal{W}\times\Theta$ , and thus the information of the evaluation of the objective w.r.t. particular points is lost. However, it is usually the case that saddle point solvers provide a bound of the form $F_{S}(\bar{w},\theta)-F_{S}(w,\bar{\theta})\leq\left\|[w,\theta]-[w^{\prime},\theta^{\prime}]\right\|\hat{\alpha}$ , for all $[w,\theta]\in\mathcal{W}\times\Theta$ , and some initial point $[w^{\prime},\theta^{\prime}]\in\mathcal{W}\times\Theta$ . Algorithms such as the proximal point method, extragradient method, and SGDA (with appropriately tuned learning rate) satisfy this condition, and thus satisfy the condition for relative accuracy [24, 26, 19].

Guarantees of Recursive Regularization

Given such an algorithm, recursive regularization achieves the following guarantee.

Theorem 1.

Let $\mathcal{A}_{\mathsf{emp}}$ satisfy $\hat{\alpha}$ -relative accuracy for any $(5L)$ -Lipschitz loss function and dataset of size $n^{\prime}=\frac{n}{\log(n)}$ . Then Algorithm 1, run with $\mathcal{A}_{\mathsf{emp}}$ as a subroutine and $\lambda=\frac{48}{B}\left({\hat{\alpha}+\frac{L}{\sqrt{n^{\prime}}}}\right)$ , satisfies

[TABLE]

Recall that $B$ is a bound on the diameter of the constraint set. In the following, we will sketch the proof of this theorem and highlight key lemmas. We defer the full proof to Appendix B.2. For simplicity, let us here consider the case where $\hat{\alpha}=0$ . A crucial aspect of our proof is that we avoid the need to bound the strong gap of the actual iterates, $\left\{{\bar{w}_{t}}\right\}_{t=1}^{T-1}$ . Instead, we bound the strong gap of the expected iterates, where the expectation is taken with respect to $S_{t}$ . More concretely, consider some $t\in[T]$ and let $\mathcal{B}$ be the algorithm which on input $[\bar{w}_{t-1},\bar{\theta}_{t-1}]$ outputs $\underset{S_{t},\mathcal{A}_{\mathsf{emp}}}{\mathbb{E}}\left[\mathcal{A}_{\mathsf{emp}}(S_{t},f^{t},[\bar{w}_{t-1},\bar{\theta}_{t-1}],\frac{B}{2^{t}})\right]$ . Note $\mathcal{B}$ is deterministic and data independent. As a result, it is possible to prove bounds on the strong gap of $\mathcal{B}$ .

Lemma 3.

Let $S\sim\mathcal{D}^{n}$ . For any $\Delta$ -uniform argument stable algorithm $\mathcal{A}$ , it holds that

[TABLE]

The proof follows straightforwardly from an application of Jensen’s inequality and the “stability implies generalization” result for the weak gap [23, Theorem 1]. We give full details in Appendix B.1. Note that, for this discussion, the LHS of the above is equal to $\operatorname{Gap}(\mathcal{B})$ when we apply this lemma to the data batch $S_{t}$ and subroutine $\mathcal{A}_{\mathsf{emp}}$ .

In fact, running $\mathcal{B}$ is infeasible. Instead, we show that the output $\mathcal{A}_{\mathsf{emp}}$ is close to the output of $\mathcal{B}$ . This in turn can be accomplished using the fact that bounded stability implies bounded variance. Concretely, we use the vector valued version of McDiarmid’s inequality.

Lemma 4.

[32, Lemma 6]** 555Although stated therein for the distance, the last step of their proof shows a squared distance bound can be obtained. Let $\mathcal{A}$ be deterministic $\Delta$ -uniform argument stable stable with respect to $S\sim\mathcal{D}^{n}$ . Then its output satisfies $\mathbb{E}\Big{[}\big{\|}\mathcal{A}(S)-\mathbb{E}_{\hat{S}\sim\mathcal{D}^{n}}\big{[}\mathcal{A}(\hat{S})\big{]}\big{\|}^{2}\Big{]}\leq n\Delta^{2}.$

Observe that the exact empirical saddle point is a deterministic quantity conditioned on the randomness of the $t$ ’th empirical objective. Using the fact that $(2^{t}\lambda)$ -regularization implies $\big{(}\frac{L}{2^{t}\lambda n^{\prime}}\big{)}$ -stability of the empirical saddle point in conjunction with the above lemma, we obtain a (conditional) variance bound of $\frac{L^{2}}{2^{2t}\lambda^{2}n^{\prime}}$ . Under the setting of $\lambda=\Omega(\frac{L}{B\sqrt{n^{\prime}}})$ , we can ultimately prove that the distance between the output of $\mathcal{A}_{\mathsf{emp}}$ and $\mathcal{B}$ (at round $t$ ) is $O(\frac{B}{2^{t}})$ . Since the strong gap of $\mathcal{B}$ with respect to $F_{\mathcal{D}}^{(t)}(w,\theta):=\mathbb{E}_{x\sim{\cal D}}[f^{(t)}(w,\theta;x)]$ is at most $\Delta L=\frac{L^{2}}{2^{t}\lambda n^{\prime}}$ by Lemma 3 (recall we here assume $\hat{\alpha}=0$ for simplicity) and $F_{\mathcal{D}}^{(t)}$ is $(2^{t+1}\lambda)$ -SC/SC, the output of $\mathcal{B}$ must in turn be close to the population saddle point. Specifically, this distance is also bounded as $\big{(}\frac{\Delta L}{2^{t}\lambda}\big{)}^{1/2}=\frac{L}{\sqrt{2^{t}\lambda n^{\prime}}}\frac{1}{\sqrt{2^{t}\lambda}}=O(\frac{B}{2^{t}})$ . Thus we ultimately have that the distance between $[\bar{w}_{t},\bar{\theta}_{t}]$ and the population saddle point of $F^{(t)}_{\mathcal{D}}$ , $[w^{*}_{t},\theta^{*}_{t}]$ , satisfies $\underset{}{\mathbb{E}}\left[\left\|[\bar{w}_{t},\bar{\theta}_{t}]-[w^{*}_{t},\theta^{*}_{t}]\right\|\right]=O(\frac{B}{2^{t}})$ . These ideas also lead to a bound $\underset{}{\mathbb{E}}\left[\left\|[w^{*}_{t+1},\theta^{*}_{t+1}]-[\bar{w}_{t},\bar{\theta}_{t}]\right\|\right]=O(\frac{B}{2^{t}})$ , although the argument in this case is more technical and thus deferred to the full proof.

The upshot of this analysis is that as the level of regularization increases, the distance of the iterates to the their respective population minimizers decreases in kind. One consequence of this fact is that $\left\|[\bar{w}_{T},\bar{\theta}_{T}]-[w_{T}^{*},\theta_{T}^{*}]\right\|=\tilde{O}\left({\frac{B}{\sqrt{n}}}\right)$ , and thus by the Lipschitzness of the gap function, the output of recursive regularization has a gap bound close to that of $[w^{*}_{T},\theta_{T}^{*}$ ]. Turning now towards the utility of $[w^{*}_{T},\theta_{T}^{*}]$ , using the fact that $F_{\mathcal{D}}$ is convex-concave we have

[TABLE]

Further, an expression for $G_{\mathcal{D}}$ be obtained using the definition of $F_{\mathcal{D}}^{(T)}$ :

[TABLE]

where $G_{\mathcal{D}}^{(T)}$ is the saddle operator of $F_{\mathcal{D}}^{(T)}$ . Plugging the latter into the former and using Cauchy-Schwarz inequality, the triangle inequality, and the fact that $[w_{T}^{*},\theta_{T}^{*}]$ is the exact saddle point of $F_{\mathcal{D}}^{(T)}$ , one can obtain a bound on the gap in terms of the distances discussed previously.

[TABLE]

where step $(i)$ comes from a triangle inequality and step $(ii)$ is obtained from a series of algebraic manipulations which are expanded upon in the full proof. Finally, in the case where $\hat{\alpha}>0$ , extra steps are required to bound the distance of output of $\mathcal{A}_{\mathsf{emp}}$ to the exact saddle point of $F_{S}^{(t)}(w,\theta):=\frac{1}{n^{\prime}}\sum_{x\in S_{t}}f^{(t)}(w,\theta;x)$ . This is accomplished using the SC/SC property of $F_{S}^{(t)}$ and the $\hat{\alpha}$ -relative accuracy guarantee of $\mathcal{A}_{\mathsf{emp}}$ .

4 Optimal Strong Gap Rate for DP-SSP

With the guarantees of recursive regularization established, what remains is to show there exist $(\epsilon,\delta)$ -DP algorithms which achieve a sufficient accuracy on the empirical objective. Note this suffices to make the entire recursive regularization algorithm private.

Theorem 2.

Let $\mathcal{A}_{\mathsf{emp}}$ used in Algorithm 1 be $(\epsilon,\delta)$ -DP. Then Algorithm 1 is $(\epsilon,\delta)$ -DP.

This follows simply from post processing the parallel composition theorem for differential privacy, since each run of $\mathcal{A}_{\mathsf{emp}}$ is run on a disjoint partition of the dataset.

4.1 Efficient algorithm for the non-smooth setting

In the non-smooth setting, one can obtain optimal rates on the empirical gap using noisy stochastic gradient descent ascent (noisy SGDA). We give this algorithm in detail in Appendix C.2. More briefly, noisy SGDA starts at $[w_{0},\theta_{0}]\in\mathcal{W}\times\Theta$ and takes parameters $T,\eta>0$ , where $T$ is the number of iterations and $\eta$ is the learning rate. New iterates are obtained via the update rule $[w_{t+1},\theta_{t+1}]=[w_{t},\theta_{t}]-\frac{\eta}{|M_{t}|}\sum_{x\in M_{t}}g(w_{t},\theta_{t};x)+\xi_{t}$ , where $\xi_{0},...,\xi_{T-1}$ are i.i.d. Gaussian noise vectors and $M_{t}$ is a minibatch sampled uniformly with replacement from $S$ . The algorithm then returns the average iterate, $\frac{1}{T}\sum_{t=0}^{T-1}[w_{t},\theta_{t}]$ . Noisy SGDA can be used to obtain the following result.

Lemma 5.

There exists an $(\epsilon,\delta)$ -DP algorithm which satisfies $\hat{\alpha}$ -relative accuracy with $\hat{\alpha}=O\left({\frac{\log(n)L\sqrt{d\log(1/\delta)}}{n\epsilon}}\right)$ and runs in $O\left({\min\left\{{\frac{n^{2}\epsilon^{1.5}}{\log^{2}(n)\sqrt{d\log(1/\delta)}},\frac{n^{3/2}}{\log^{3/2}(n)}}\right\}}\right)$ gradient evaluations.

Applying Theorem 1 then yields a near optimal rate on the strong gap.

Corollary 1.

There exists an Algorithm, $\mathcal{R}$ , which is $(\epsilon,\delta)$ -DP, has gradient evaluations bounded by $O\big{(}\min\big{\{}\frac{n^{2}\epsilon^{1.5}}{\log(n)\sqrt{d\log(1/\delta)}},\frac{n^{3/2}}{\sqrt{\log(n)}}\big{\}}\big{)}$ , and satisfies

[TABLE]

4.2 Near linear time algorithm for the smooth setting

In the smooth setting, we can achieve the optimal rate in nearly linear time. Our result leverages accelerated algorithms for smooth and strongly convex-strongly concave saddle point problems [20, 31].

Lemma 6.

(JST [20, Theorem 3, Corollary 41]) Let $f:\mathcal{W}\times\Theta\times\mathcal{X}\mapsto\mathbb{R}$ be $\beta$ -smooth and $\alpha>0$ . Let both $h_{w}:\mathcal{W}\mapsto\mathbb{R}$ and $h_{\theta}:\Theta\mapsto\mathbb{R}$ be $c_{1}\mu$ -strongly convex and $c_{2}\mu$ -smooth functions for some $\mu>0$ and constants $c_{1},c_{2}$ . Consider the objective $F_{h}(w,\theta;S)=\sum_{t=1}^{T}f(w,\theta;S)+h_{w}(w)-h_{\theta}(\theta)$ . Then there exists an algorithm which finds an approximate saddle point of $F_{h}$ with empirical gap at most $\alpha$ in $O\left({\kappa\log(\kappa)\log(\frac{\kappa BL}{\alpha})}\right)$ gradient evaluations, where $\kappa=O(n+\sqrt{n}(1+\beta/\mu))$ .

Given this, we consider the following implementation of $\mathcal{A}_{\mathsf{emp}}$ . Define $[w_{S,t}^{*},\theta_{S,t}^{*}]$ to be the saddle point of $F^{(t)}(w,\theta)=\frac{1}{n}\sum_{x\in S_{t}}f^{(t)}(w,\theta;x)$ for all $t\in[T]$ . At round $t\in[T]$ , find a point $[\hat{w}_{t},\hat{\theta}_{t}]$ such that $\underset{}{\mathbb{E}}\left[\|[\hat{w}_{t},\hat{\theta}_{t}]-[w_{S,t}^{*},\theta_{S,t}^{*}]\|^{2}\right]\leq\big{(}\frac{\delta}{5}\cdot\frac{L}{2^{t}\lambda n^{\prime}}\big{)}^{2}$ . We can find this point efficiently using the algorithm from [20] referenced above. Then output $[\bar{w}_{t},\bar{\theta}_{t}]=[\hat{w}_{t},\hat{\theta}_{t}]+\xi_{t}$ where $\xi_{t}\sim\mathcal{N}(0,\mathbb{I}_{d}\sigma_{t}^{2})$ and $\sigma_{t}=\frac{8L\sqrt{\log(2/\delta)}}{2^{t}\lambda n^{\prime}\epsilon}$ . This implementation gives us the following result.

Theorem 3.

Let $\mathcal{A}_{\mathsf{emp}}$ be as described above. Then Algorithm 1 is $(\epsilon,\delta)$ -DP and when run with $\lambda=\frac{48}{B}\left({\frac{L}{\sqrt{n^{\prime}}}+\frac{L\sqrt{d\log(2/\delta)}}{n^{\prime}\epsilon}}\right)$ satisfies

[TABLE]

and runs in at most $O\left({\kappa\log(\kappa)\log(\kappa n/\delta)\log(n)}\right)$ gradient evaluations with $\kappa=O\left({n+n\beta B/L}\right)$ .

proof of Theorem 3.

In the following, we start by proving the privacy guarantee. Then, we prove the utility guarantee, and finish by verifying the running time of the algorithm.

Privacy Guarantee: Consider any $t\in[T]$ and fix $[w_{1},\theta_{1}],...,[w_{t-1},\theta_{t-1}]$ . The stability of the regularized saddle point at round $t$ , $[w^{*}_{S,t},\theta^{*}_{S,t}]$ , is then $\frac{L}{2^{t}\lambda n^{\prime}}$ by Lemma 1. Since $\mathcal{A}_{\mathsf{emp}}$ guarantees that $\underset{}{\mathbb{E}}\left[\|[\hat{w}_{t},\hat{\theta}_{t}]-[w^{*}_{S,t},\theta^{*}_{S,t}]\|\right]\leq\frac{\delta}{5}\cdot\frac{L}{2^{t}\lambda n^{\prime}}$ , we have by Markov’s inequality that with probability at least $1-\frac{\delta}{2}$ that $\|[\hat{w}_{t},\hat{\theta}_{t}]-[w^{*}_{S,t},\theta^{*}_{S,t}]\|\leq\frac{L}{2^{t}\lambda n^{\prime}}$ . Thus with probability at least $1-\frac{\delta}{2}$ , generating $[\hat{w}_{t},\hat{\theta}_{t}]$ satisfies $\frac{2L}{2^{t}\lambda n^{\prime}}$ uniform argument stability. Thus Gaussian noise of scale $\sigma_{t}=\frac{8L\sqrt{\log(2/\delta)}}{2^{t}\lambda n^{\prime}\epsilon}$ ensures the round is $(\epsilon,\delta)$ -DP. Parallel composition then ensures the entire algorithm is $(\epsilon,\delta)$ -DP since each phase acts on a disjoint partition of the dataset.

Utility Guarantee: We now turn to the accuracy guarantee. Specifically, we leverage the generalized convergence guarantee of Algorithm 1 given by Theorem 5 in Appendix B. This theorem guarantees that so long as the distance condition $\underset{}{\mathbb{E}}\left[\left\|[\bar{w}_{t},\bar{\theta}_{t}]-[w^{*}_{S,t},\theta^{*}_{S,t}]\right\|^{2}\right]\leq\frac{B^{2}}{12\cdot 2^{2t}}$ is satisfied for all $t\in[T]$ , one obtains convergence guarantee $\operatorname{Gap}(\mathcal{R})=O(\log(n)B^{2}\lambda)$ . That is, after the distance guarantee is established, the rest of the analysis (i.e. the proof of Theorem 5) follows the same lines as in the non-smooth case. Note under the setting of $\lambda$ in Theorem 3 we have

[TABLE]

Thus all that remains is to show that the distance condition, $\underset{}{\mathbb{E}}\left[\left\|[\bar{w}_{t},\bar{\theta}_{t}]-[w^{*}_{S,t},\theta^{*}_{S,t}]\right\|^{2}\right]\leq\frac{B^{2}}{12\cdot 2^{2t}}$ , is satisfied for all $t\in[T]$ . In this regard we have,

[TABLE]

For the first inequality, observe that the noise vector is uncorrelated with the vectors, $[\hat{w}_{t},\hat{\theta}_{t}]$ and $[w_{S,t}^{*},\theta_{S,t}^{*}]$ . For the second inequality note $\underset{}{\mathbb{E}}\left[\|[\bar{w}_{t},\bar{\theta}_{t}]-[\hat{w}_{t},\hat{\theta}_{t}]\|^{2}\right]=\underset{}{\mathbb{E}}\left[\|\xi_{t}\|^{2}\right]=d\sigma^{2}_{t}$ . Further, $\underset{}{\mathbb{E}}\left[\|[\hat{w}_{t},\hat{\theta}_{t}]-[w_{S,t}^{*},\theta_{S,t}^{*}]\|^{2}\right]$ is bounded due to the chosen implementation of $\mathcal{A}_{\mathsf{emp}}$ . The third inequality comes from the settings of $\sigma_{t}$ and the fact that $\lambda>\frac{48L}{B\sqrt{n^{\prime}}}$ . The last inequality uses the fact that $\lambda>\frac{48L\sqrt{d\log(2/\delta)}}{Bn^{\prime}\epsilon}$ .

Running Time: One can ensure that overall algorithm runs in nearly linear time by leveraging accelerated methods to find the point $[\hat{w},\hat{\theta}_{t}]$ . The description of $\mathcal{A}_{\mathsf{emp}}$ requires that at each phase $t\in[T]$ , one has $\underset{}{\mathbb{E}}\left[\|[\hat{w}_{t},\hat{\theta}_{t}]-[w_{S,t}^{*},\theta_{S,t}^{*}]\|^{2}\right]\leq\big{(}\frac{\delta}{5}\cdot\frac{L}{2^{t}\lambda n^{\prime}}\big{)}^{2}$ , which by Lemma 2 is satisfied if the empirical gap is at most $\lambda\big{(}\frac{\delta}{5}\cdot\frac{L}{2^{t}\lambda n^{\prime}}\big{)}^{2}=\frac{\delta^{2}}{25}\cdot\frac{L^{2}}{2^{2t}\lambda(n^{\prime})^{2}}$ . For simplicity, we observe that

[TABLE]

We now apply Lemma 6 with $h_{w}(w)=\lambda\sum_{k=0}^{t-1}2^{k+1}\left\|w-\bar{w}_{k}\right\|^{2}$ , $h_{\theta}(\theta)=\lambda\sum_{k=0}^{t-1}2^{k+1}\left\|w-\bar{w}_{k}\right\|^{2}$ , $\mu=2^{t}\lambda$ and $\alpha=\frac{c_{3}\delta^{2}BL}{n^{2.5}}$ for some sufficiently small constant $c_{3}$ . This gives that the running time of phase $t$ is $O\left({\kappa_{t}\log(\kappa_{t})\log(\kappa_{t}n^{2.5}/\delta^{2}]}\right)$ , where $\kappa_{t}=O\left({n+\sqrt{n}\beta/[2^{t}\lambda])}\right)=O\left({n+n\beta B/L}\right)$ . Running this implementation of $\mathcal{A}_{\mathsf{emp}}$ each phase incurs an extra factor of $T=\log(\frac{L}{B\lambda})=O(\log(n))$ , giving the claimed running time bound of $O\left({\kappa\log(\kappa)\log(\kappa n/\delta]\log(n)}\right)$ , where $\kappa=O\left({n+n\beta B/L}\right)$ . ∎

5 On the Limitations of Previous Approaches

Prior work into DP SSPs has largely focused on the weak gap criteria. In this section, we provide further investigation into both the importance and challenges of bounding the strong gap over the weak gap. We start by considering a natural question. Do there exist cases where the strong and weak gap differ substantially? We answer this question affirmatively in the following.

Proposition 1.

There exists a convex-concave function $f$ with range $[-1,+1]$ and algorithm $\mathcal{A}$ such that $\operatorname{Gap}(\mathcal{A})-\operatorname{Gap}_{\mathsf{weak}}(\mathcal{A})=2$ .

Our construction shows that this result holds even for a simple one dimensional bilinear problem.

Proof.

Consider the loss function $f(w,\theta;x)=w\theta$ , where $w,\theta,x\in[-1,1].$ Let $\mathcal{D}$ be the uniform distribution over $\left\{{\pm 1}\right\}$ . For $\left\{{x_{1},\dots,x_{n}}\right\}\sim\mathcal{D}^{n}$ consider the algorithm $\mathcal{A}$ which outputs $\bar{w}$ as the mode of the first half of the samples in $S$ and similarly $\bar{\theta}$ is set as the mode of the second half of the samples in $S$ 666Without much loss of generality, we assume that $n$ is divisible by 2 but not by 4, so that the mode of each half of the data are well-defined and belong to $\{-1,+1\}$ .. Note $\bar{w}$ and $\bar{\theta}$ are independent and distributed uniformly over $\left\{{\pm 1}\right\}$ (under the randomness from $\mathcal{D}$ ).

Now, since $\mathcal{A}$ is a deterministic function of the dataset, the randomness in $\bar{w},\bar{\theta}$ comes only from $S$ . Thus for the weak gap we have $\max\limits_{\theta\in[-1,1]}\{\underset{S}{\mathbb{E}}\left[\bar{w}\theta\right]\}-\min\limits_{w\in[-1,1]}\{\underset{S}{\mathbb{E}}\left[w\bar{\theta}\right]\}$ which evaluates to $\max_{\theta\in[-1,1]}\{\underset{S}{\mathbb{E}}\left[\bar{w}\right]\theta\}-\min_{w\in[-1,1]}\{w\underset{S}{\mathbb{E}}\left[\bar{\theta}\right]\}=0.$ However, one can see for the strong gap we have $\underset{S}{\mathbb{E}}\left[\max\limits_{\theta\in[-1,1]}\left\{{\bar{w}\theta}\right\}-\min\limits_{w\in[-1,1]}\left\{{w\bar{\theta}}\right\}\right]=\underset{S}{\mathbb{E}}\left[\left|{\bar{w}}\right|+\left|{\bar{\theta}}\right|\right]=2$ , where the first equality comes from evaluating $\theta=\mathsf{sgn}(\bar{w})$ and $w=-\mathsf{sgn}(\bar{\theta})$ in the maximization and minimization operators. ∎

Observe that the generalization error w.r.t. the strong gap of this algorithm is always [math] because the loss function does not depend on the random sample from $\mathcal{D}$ . The discrepancy between the gaps instead comes from the fact that having the expectation w.r.t. $S$ inside the max/min changes the function over which the dual/primal adversary is maximizing/minimizing. Specifically, note here that the weak gap measures the ability of $\theta$ to maximize the function $\theta\mapsto\bar{w}\theta$ for $\bar{w}=0$ , but note $\bar{w}=0$ does not occur for any realization of the dataset $S$ .

One might further observe that a key attribute of this construction is the high variance of the parameter vectors. One can show such behavior is in fact necessary to see such a separation; the full proof of the following is statement is given in Appendix D.1.

Proposition 2.

Let $\mathcal{A}$ be an algorithm such that $\underset{\mathcal{A},S}{\mathbb{E}}\left[\left\|\mathcal{A}(S)-\mathbb{E}_{\hat{S}\sim\mathcal{D}^{n},\mathcal{A}}{\mathcal{A}(\hat{S})}\right\|^{2}\right]\leq\tau^{2},$ then if $f$ is $L$ -Lipschitz it holds that $\operatorname{Gap}(\mathcal{A})-\operatorname{Gap}_{\mathsf{weak}}(\mathcal{A})\leq L\tau.$

Tradeoff between Accuracy and Stability

An additional consequence of Proposition 2 (in conjunction with Lemma 5) is that $\Delta$ -uniform argument stability implies $\sqrt{n}\Delta L$ generalization bound w.r.t. the strong gap that does not rely on smoothness (in contrast to the $\sqrt{L\beta\Delta}$ bound of [30] which does). We leave determining tight bounds for stability implies generalization on the strong gap as an interesting direction for future work. In this section however, we show that stronger upper bounds are likely necessary to obtain a more direct algorithm for DP-SSPs. In fact, our key result holds even for empirical risk minimization (ERM) problems. That is, for $f:\mathcal{W}\times{\cal X}\mapsto\mathbb{R}$ and $S\in\mathcal{X}^{n}$ , consider the problem of minimizing the excess empirical risk $F_{S}(w)-\min_{w\in\mathcal{W}}\left\{{F_{S}(w)}\right\}$ , where $F_{S}(w)=\frac{1}{n}\sum_{x\in S}f(w;x)$ . We have the following.

Theorem 4.

For any (possibly randomized) algorithm $\mathcal{A}:\mathcal{X}^{n}\mapsto\mathcal{W}$ which is $\Delta$ -uniform argument stable, there exists a [math]-smooth $L$ -Lipschitz loss function, $f:\mathcal{W}\times\mathcal{X}\mapsto\mathbb{R}$ , and dataset $S\in\mathcal{X}^{n}$ such that $\mathbb{E}[F_{S}(\mathcal{A}(S))-\min\limits_{w\in\mathcal{W}}\left\{{F_{S}(w)}\right\}]=\Omega\left({\frac{B^{2}L}{\Delta n}}\right)$ provided $\Delta\geq\frac{B}{\sqrt{\min\left\{{n,d}\right\}}}$ .

The proof can be found in Appendix D.2. Lemma 1 shows this bound is tight for both ERM and empirical saddle point problems. Generalization bounds are only useful when it is possible to obtain good empirical performance. Thus, the implication of this bound is that generalization error which is $O(\Delta)$ is necessary to obtain the optimal $O\left({1/\sqrt{n}}\right)$ statistical rate. To elaborate, let $H(\Delta)$ characterize some (potentially suboptimal) generalization bound for $\Delta$ stable algorithms and assume $H(\Delta)=\omega(\Delta)$ . To then bound the sum of empirical risk and generalization error, Theorem 4 implies $F_{S}(\mathcal{A}(S))-F_{S}(w^{*})+H(\Delta)=\Omega\left({\frac{1}{\Delta n}+H(\Delta)}\right)=\omega\left({\frac{1}{\Delta n}+\Delta}\right).$ Note the RHS is asymptotically larger than $\frac{1}{\sqrt{n}}$ (i.e. not optimal) for any $\Delta$ .

Acknowledgements

RB’s and MM’s research is supported by NSF CAREER Award 2144532 and NSF Award AF-1908281. CG’s research was partially supported by INRIA Associate Teams project, FONDECYT 1210362 grant, ANID Anillo ACT210005 grant, and National Center for Artificial Intelligence CENIA FB210017, Basal ANID.

Appendix A Supporting Proofs from Preliminaries

A.1 Lipschitzness of the Gap Function

proof of Fact 1.

For any $[\bar{w},\bar{\theta}],[\bar{w}^{\prime},\bar{\theta}^{\prime}]\in\mathcal{W}\times\Theta$ we have

[TABLE]

where we used in the last inequality that $a+b\leq\sqrt{2}\sqrt{a^{2}+b^{2}}$ . ∎

A.2 Local Privacy

In the case of local differential privacy (LDP), a simple implementation of noisy SGDA (see Appendix C.1) suffices to obtain the optimal rate. We defer the reader to DJW [11] for a discussion of LDP and the matching lower bound. Consider the implementation of SGDA which defines the saddle estimator as

[TABLE]

where $\xi_{t}\sim\mathcal{N}(0,\mathbb{I}_{d}\sigma)$ and $\sigma=\frac{L\sqrt{\log(1/\delta)}}{\epsilon}$ and $x_{t}$ is sampled without replacement from $S$ . By Lemma 9 we have the following.

Corollary 2.

Let $T=n$ . Then the algorithm described above, denoted as $\mathcal{A}$ , is $(\epsilon,\delta)$ -LDP and if $\eta=\frac{B}{\sqrt{nd\log(1/\delta)}L\epsilon}$ the average iterate, $[\bar{w},\bar{\theta}]$ , satisfies $\operatorname{Gap}(\mathcal{A})=O\left({\frac{BL\sqrt{d\log(1/\delta)}}{\sqrt{n}\epsilon}}\right).$

Appendix B Missing Results from Section 3

B.1 Proof of Lemma 3

The first inequality follows from an application of Jensen’s inequality.

[TABLE]

The second inequality in the theorem statement then follows from stability implies generalization result for the weak gap, for which we provide a restatement below.

Lemma 7.

[23, Theorem 1]**, [7, Proposition 2.1] Let the loss function $f$ be $L$ -Lipschitz and the algorithm $\mathcal{A}$ be $\Delta$ -uniform argument stable. Then $\operatorname{Gap}_{\mathsf{weak}}(\mathcal{A})\leq\underset{S}{\mathbb{E}}\left[\operatorname{Gap}_{S}(\mathcal{A})\right]+\Delta L.$

B.2 Convergence of Recursive Regularization

In this section we prove the following more general statement of Theorem 1, which will be useful later.

Theorem 5.

Let $\lambda\geq\frac{48L}{B\sqrt{n^{\prime}}}$ and $\mathcal{A}_{\mathsf{emp}}$ be such that for all $t\in[T]$ it holds that $\underset{}{\mathbb{E}}\left[\left\|[\bar{w}_{t},\bar{\theta}_{t}]-[w^{*}_{S,t},\theta^{*}_{S,t}]\right\|^{2}\right]\leq\frac{B^{2}}{12\cdot 2^{2t}}$ . Then Recursive Regularization satisfies

[TABLE]

To prove this result, it will be helpful to first show several intermediate results. We start by defining several useful quantities. Define $\left\{{\mathcal{F}_{t}}\right\}_{t=0}^{T}$ as the filtration where $\mathcal{F}_{t}$ is the sigma algebra induced by all randomness up to $[\bar{w}_{t},\bar{\theta}_{t}]$ . For every $t\in[T]$ we define

•

$[w^{*}_{t},\theta^{*}_{t}]:$ saddle point of $F_{\mathcal{D}}^{(t)}(w,\theta):=\underset{x\sim\mathcal{D}}{\mathbb{E}}\left[f^{(t)}(w,\theta;x)\right]$ ;

•

$[w^{*}_{S,t},\theta^{*}_{S,t}]:$ saddle point of $F_{S}^{(t)}(w,\theta):=\frac{1}{n}\sum_{x\in S}f^{(t)}(w,\theta;x)$ ;

•

$[\widetilde{w}_{t},\widetilde{\theta}_{t}]:=\underset{}{\mathbb{E}}\left[[w^{*}_{S,t},\theta^{*}_{S,t}]\Big{|}\mathcal{F}_{t-1}\right]$ ;

•

$\widehat{\operatorname{Gap}}^{(t)}(\bar{w},\bar{\theta}):=\max\limits_{\theta\in\Theta}\left\{{F^{(t)}_{\mathcal{D}}(\bar{w},\theta)}\right\}-\min\limits_{w\in\mathcal{W}}\left\{{{F^{(t)}_{\mathcal{D}}(w,\bar{\theta})}}\right\}:$ the gap function w.r.t. $F_{\mathcal{D}}^{(t)}$ ; and,

•

$\widehat{\operatorname{Gap}}_{S}^{(t)}(\bar{w},\bar{\theta}):=\max\limits_{\theta\in\Theta}\left\{{F^{(t)}_{S_{t}}(\bar{w},\theta)}\right\}-\min\limits_{w\in\mathcal{W}}\left\{{{F^{(t)}_{S_{t}}(w,\bar{\theta})}}\right\}:$ the empirical gap function.

We now establish two distance inequalities which will be used when analyzing the final gap bound in Theorem 5. The first inequality above bounds the distance of the output of the $t$ -th round to the minimizer of $F_{\mathcal{D}}^{(t)}$ . The second inequality bounds the distance of the minimizer of $F_{\mathcal{D}}^{(t)}$ to the most recent regularization point.

Lemma 8.

Assume the conditions of Theorem 5 hold. Then for every $t\in[T]$ , the following holds

P.1

$\underset{}{\mathbb{E}}\left[\left\|[\bar{w}_{t},\bar{\theta}_{t}]-[w_{t}^{*},\theta_{t}^{*}]\right\|\right]^{2}\leq\underset{}{\mathbb{E}}\left[\left\|[\bar{w}_{t},\bar{\theta}_{t}]-[w_{t}^{*},\theta_{t}^{*}]\right\|^{2}\right]\leq\frac{B^{2}}{2^{2t}}$ ; and, 2. P.2

$B_{t}^{2}:=\underset{}{\mathbb{E}}\left[\left\|[w_{t}^{*},\theta_{t}^{*}]-[\bar{w}_{t-1},\bar{\theta}_{t-1}]\right\|\right]^{2}\leq\underset{}{\mathbb{E}}\left[\left\|[w_{t}^{*},\theta_{t}^{*}]-[\bar{w}_{t-1},\bar{\theta}_{t-1}]\right\|^{2}\right]\leq\frac{B^{2}}{2^{2(t-1)}}$ .

Proof.

We will prove both properties via induction on $B_{1},...,B_{T}$ . Specifically, for each $t\in[T]$ we will introduce three terms $E_{t},F_{t},G_{t}$ , and show that these terms are bounded if the bound on $B_{t}$ holds and that $B_{t}$ holds if $E_{t-1},F_{t-1},G_{t-1}$ are bounded. Property P.1 is then established as a result of the fact that $\underset{}{\mathbb{E}}\left[\left\|[\bar{w}_{t},\bar{\theta}_{t}]-[w_{t}^{*},\theta_{t}^{*}]\right\|^{2}\right]\leq 3(E_{t}+F_{t}+G_{t})$ . Note that $B_{1}$ holds as the base case because $\underset{}{\mathbb{E}}\left[\left\|[w_{1}^{*},\theta_{1}^{*}]-[\bar{w}_{0},\bar{\theta}_{0}]\right\|^{2}\right]\leq B^{2}$ .

Property P.1:

We here prove that if $B_{t}$ is sufficiently bounded, then $E_{t},F_{t},G_{t}$ are bounded where for $t\in[T]$ we define

[TABLE]

Additionally, this will establish property P.1 because for any $t\in[T]$ it holds that,

[TABLE]

The second inequality comes from the strong convexity-strong concavity of the loss.

Bounding $E_{t}$ : We have that $E_{t}$ is bounded by the assumption made in the statement of Theorem 5.

Bounding $F_{t}$ :

[TABLE]

The first inequality comes from the stability of the regularized minimizer and Lemma 5. The second inequality comes from the setting of $\lambda\geq\frac{48L}{B\sqrt{n^{\prime}}}$ .

Bounding $G_{t}$ : We have

[TABLE]

The first equality comes from the definition of $[\widetilde{w}_{t},\widetilde{\theta}_{t}]$ . The first inequality comes from Lemma 3, where we consider the algorithm stated in the lemma to be the algorithm which outputs the exact regularized minimizer. Note this algorithm is $\frac{L^{2}}{2^{t}\lambda n^{\prime}}$ stable. The second equality comes from the fact that $[w^{*}_{S,t},\theta^{*}_{S,t}]$ is the exact empirical saddle point. The final inequality uses the same analysis as in Eqn. (7).

We thus have a final bound $3(E_{t}+F_{t}+G_{t})\leq\frac{B^{2}}{2^{2t}}$ .

Property P.2:

Now assume $B_{t-1}$ holds. We have

[TABLE]

Above $E_{t-1}$ and $F_{t-1}$ are as defined in (5). We bound the remaining squared distance term in the following. First, note that the primal function $F^{(t)}(\cdot,\theta_{t}^{*})$ is strongly convex and $\forall w\in\mathcal{W}$ it holds that $\left\langle\nabla_{w}F_{\cal D}^{(t)}(w_{t}^{*},\theta_{t}^{*}),w_{t}^{*}-w\right\rangle\leq 0$ . Similar facts hold for $-F^{(t)}(w_{t}^{*},\cdot)$ . Thus we have

[TABLE]

The second inequality comes from removing the negative norm terms. The third inequality comes from the definition of $E_{t-1}$ and $F_{t-1}$ . The second to last inequality comes from the definition of $G_{t-1}$ , as given in Eqn. (5). Plugging this result into (8) and using the previously established bounds on $E_{t-1},F_{t-1},G_{t-1}$ (which hold under the assumed bound on $B_{t-1}$ ) we have

[TABLE]

∎

We now turn to analyzing the utility of the algorithm to complete the proof.

proof of Theorem 5.

Using the fact that $\widehat{\operatorname{Gap}}$ is $\sqrt{2}L$ -Lipschitz and property P.1, we have

[TABLE]

What remains is showing $\underset{}{\mathbb{E}}\left[\widehat{\operatorname{Gap}}(w_{T}^{*},\theta_{T}^{*})\right]$ is $\tilde{O}(B\hat{\alpha}+\frac{BL}{\sqrt{n^{\prime}}})$ . Let $w^{\prime}=\operatorname*{arg\,min}\limits_{\theta\in\Theta}{F_{\mathcal{D}}(w,\theta_{T}^{*})}$ and $\theta^{\prime}=\operatorname*{arg\,max}\limits_{w\in\mathcal{W}}{F_{\mathcal{D}}(w_{T}^{*},\theta})$ . Using the fact that $F_{\mathcal{D}}$ is convex-concave we have

[TABLE]

where $G_{\mathcal{D}}$ is the population loss saddle operator. Further by the definition of $F^{(T)}$ and denoting $G_{\mathcal{D}}^{(T)}$ as the saddle operator for $F_{\mathcal{D}}^{(T)}$ we have

[TABLE]

Thus plugging the above into Eqn. (10) we have

[TABLE]

Above, the second inequality comes from the first order optimally conditions for $[w_{T}^{*},\theta_{T}^{*}]$ , the third from Cauchy Schwartz and a triangle inequality. The final equality uses the definition of the Euclidean norm and the fact that for any $a,b\in\mathbb{R}$ , $(-a-(-b))^{2}=(a-b)^{2}$ .

Taking the expectation on both sides of the above we have the following derivation,

[TABLE]

Above, $(i)$ and the following inequality both come from the triangle inequality. Equality $(ii)$ is obtained by rearranging the sums. Inequality $(iii)$ comes from applying properties P.1 and P.2 proved above. The last equality comes from the setting of $\lambda$ and $T$ .

Now using this result in conjunction with Eqn. (9) we have

[TABLE]

Above we use the fact that $T=\log(\frac{L}{B\lambda})$ and $\lambda\geq\frac{L}{B\sqrt{n^{\prime}}}$ , and thus $T=O(\log(n))$ . ∎

Finally, we prove Theorem 1 leveraging the relative accuracy assumption.

Proof of Theorem 1.

First, observe that under the setting of $\lambda=\frac{48}{B}\left({\hat{\alpha}+\frac{L}{\sqrt{n^{\prime}}}}\right)$ used in the theorem statement that $\log(n)B^{2}\lambda=O\left({\log(n)B\hat{\alpha}+\frac{\log^{3/2}(n)BL}{\sqrt{n}}}\right)$ . Thus what remains is to show that the distance condition required by Theorem 5 holds. That is, we now show that if $\mathcal{A}_{\mathsf{emp}}$ satisfies $\hat{\alpha}$ -relative accuracy, then for all $t\in[T]$ it holds that $\underset{}{\mathbb{E}}\left[\left\|[\bar{w}_{t},\bar{\theta}_{t}]-[w^{*}_{S,t},\theta^{*}_{S,t}]\right\|^{2}\right]\leq\frac{B^{2}}{12\cdot 2^{2t}}$ .

To prove this property, we must leverage the induction argument made by Lemma 8. Specifically, to prove the condition holds for some $t\in[T]$ , assume $B_{t}^{2}=\underset{}{\mathbb{E}}\left[\left\|[w_{t}^{*},\theta_{t}^{*}]-[\bar{w}_{t-1},\bar{\theta}_{t-1}]\right\|\right]^{2}\leq\frac{B^{2}}{2^{2(t-1)}}$ (recall the base case for $t=1$ trivially holds). As shown in the proof of Lemma 8, this implies that the quantities $F_{t},G_{t}$ (as defined in 5) are bounded by $\frac{B^{2}}{2304\cdot 2^{2t}}$ . We thus have

[TABLE]

where $B_{t}$ is as defined in property P.2. Inequality $(i)$ comes from Lemma 2. Inequality $(ii)$ comes from the $\hat{\alpha}$ -relative accuracy assumption on $\mathcal{A}_{\mathsf{emp}}$ , and the fact that each $f^{(t)}$ is $2L$ -Lipschitz. That is, observe

[TABLE]

Inequality $(iii)$ comes from a triangle inequality and the definition of $F_{t},G_{t}$ and $B_{t}$ . Inequality $(iv)$ comes from the induction hypothesis (specifically property P.2) and the bounds on $F_{t}$ and $G_{t}$ established above. The last inequality in Eqn. (B.2) comes from the setting $\lambda\geq 48\hat{\alpha}/B$ . ∎

Appendix C Missing Results from Section 4

C.1 Stochastic Gradient Descent Ascent (SGDA)

Let $F:\mathcal{W}\times\Theta\mapsto\mathbb{R}$ have saddle operator $G:\mathcal{W}\times\Theta\mapsto\mathbb{R}^{d}$ and associated strong gap $\operatorname{Gap}^{F}$ . We define the SGDA algorithm in the following manner. Let $T,\eta\geq 0$ . Let $[w_{0},\theta_{0}]$ be any vector in $\mathcal{W}\times\Theta$ . SGDA uses the following update rule. For $t\in[T-1]$ let $\nabla_{t}$ be a random vector (which may depend on $\nabla_{1},...,\nabla_{t-1}$ and $[w_{0},\theta_{0}],...,[w_{t-1},\theta_{t-1}]$ ) that is a unbiased estimate of $G(w_{t-1},\theta_{t-1})$ conditional on $[w_{t-1},\theta_{t-1}]$ and has bounded variance. We define

[TABLE]

where $\Pi_{\mathcal{W}\times\Theta}$ is the orthogonal projection onto $\mathcal{W}\times\Theta$ . The output of SGDA is defined to be

[TABLE]

We have the following result for the convergence of SGDA.

Lemma 9.

Assume $\forall t\in[T-1]$ that $\underset{}{\mathbb{E}}\left[\nabla_{t}\right]=G(w_{t},\theta_{t})$ and $\underset{}{\mathbb{E}}\left[\left\|\nabla_{t}-G(w_{t},\theta_{t})\right\|^{2}\right]\leq\tau^{2}$ , then the algorithm, $\mathcal{A}$ , that is SGDA run with parameters $T,\eta>0$ satisfies for any $w\in\mathcal{W}$ and $\theta\in\Theta$ ,

[TABLE]

This result is somewhat implicit in YHL*+* [36, Lemma 3], but for completeness we provide a short proof here.

Proof.

By the convexity-concavity of $F$ we have for any $[w,\theta]\in\mathcal{W}\times\Theta$ that

[TABLE]

and thus taking the expectation (conditional on $[w_{t},\theta_{t}]$ ) and using the fact that each $\nabla_{t}$ is unbiased we have

[TABLE]

Using $2\left\langle a,b\right\rangle=\|a\|^{2}+\|b\|^{2}-\|a-b\|^{2}$ and the fact that the projection is nonexpansive, we have

[TABLE]

where in the first equality we use that $\mathbb{E}[\langle G(w_{t},\theta_{t}),G(w_{t},\theta_{t})-\nabla_{t}\rangle]=0$ , due to the unbiasedness of the stochastic oracle.

Summing over all $T$ iterations and taking the average we obtain for the average iterate, $\bar{w},\bar{\theta}$ , and any $[w,\theta]\in\mathcal{W}\times\Theta$ that

[TABLE]

∎

C.2 Private algorithm for the empirical gap (Noisy SGDA)

We here provide an implementation of SGDA (see Appendix C.1 above) which is differentially private and yields convergence guarantees for the empirical gap. Let $M_{1},...,M_{T}$ each be a batch of $m=\max\left\{{n\sqrt{\frac{\epsilon}{4T}},1}\right\}$ samples, each sampled uniformly with replacement from $S$ . Let $\sigma^{2}=\frac{c_{0}TL^{2}\log(1/\delta)}{n^{2}\epsilon^{2}}$ for some universal constant $c_{0}$ and $\xi_{1},\dots,\xi_{T}$ each be sampled i.i.d. from $\mathcal{N}(0,\mathbb{I}_{d}\sigma^{2})$ . We define

[TABLE]

Notice that $\nabla_{t}$ as defined above satisfies the assumptions for Lemma 9 with respect to the empirical saddle operator, $G_{S}$ , for some finite $\tau$ .

We have the following result for SGDA run with this stochastic oracle.

Theorem 6.

Let $[w,\theta]\in\mathcal{W}\times\Theta$ such that $\underset{}{\mathbb{E}}\left[\left\|[w_{0},\theta_{0}]-[w,\theta]\right\|\right]\leq\hat{D}$ . Let $\mathcal{A}$ be the algorithm SGDA run with $\nabla_{1},\dots,\nabla_{T}$ as described above, $T=\min\left\{{\frac{n}{8},\frac{n^{2}\epsilon^{2}}{32d\log(1/\delta)}}\right\}$ , and $\eta=\frac{\hat{D}}{L\sqrt{T}}$ . Algorithm $\mathcal{A}$ is $(\epsilon,\delta)$ -DP, has gradient complexity $O\left({\min\left\{{\frac{n^{2}\epsilon^{1.5}}{\sqrt{d\log(1/\delta)}},n^{3/2}}\right\}}\right)$ , and satisfies

[TABLE]

The proof of the utility guarantee follows directly from applying Lemma 9 with $\tau=O(L+\sqrt{d}\sigma)=O(L)$ . The proof of the privacy guarantee relies on the moments accountant analysis, for which we provide the following restatement.

Theorem 7 ([2, 22]).

Let $\epsilon,\delta\in(0,1]$ and $c$ be a universal constant. Let $D\in\mathcal{Y}^{n}$ be a dataset over some domain $\mathcal{Y}$ , and let $h_{1},...,h_{T}:\mathcal{Y}\mapsto\mathbb{R}^{d}$ be a series of (possibly adaptive) queries such that for any $y\in\mathcal{Y}$ , $t\in[T]$ , $\left\|h_{t}(y)\right\|_{2}\leq L$ . Let $\sigma\geq\frac{cL\sqrt{T\log(1/\delta)}}{n\epsilon}$ and $T\geq\frac{n^{2}\epsilon}{b^{2}}$ . Then the algorithm which samples batches of size $B_{1},..,B_{t}$ of size $b$ uniformly at random and outputs $\frac{1}{b}\sum_{y\in B_{t}}h_{t}(y)+g_{t}$ for all $t\in[T]$ where $g_{t}\sim\mathcal{N}(0,\mathbb{I_{d}}\sigma^{2})$ , is $(\epsilon,\delta)$ -DP.

It can be verified for the described noisy SGDA implementation that $\sigma\geq\frac{c_{1}L\sqrt{T\log(1/\delta)}}{n\epsilon}$ and $T\geq\frac{n^{2}\epsilon}{m^{2}}$ and thus the algorithm is $(\epsilon,\delta)$ -DP.

Appendix D Missing Result from Section 5

D.1 Low variance and weak gap implies strong gap

proof of Proposition 2.

Consider the virtual algorithm, $\mathcal{B}(\mathcal{A},\mathcal{D})=\underset{\hat{S}\sim\mathcal{D}^{n},\mathcal{A}}{\mathbb{E}}\left[\mathcal{A}(\hat{S})\right]=[\widetilde{w},\widetilde{\theta}]$ . Note this algorithm is deterministic and does not depend on any specific dataset drawn from $\mathcal{D}$ . We first show that gap function at the output of $\mathcal{B}$ is bounded by the weak gap of $\mathcal{A}$ . We have

[TABLE]

where the second equality follows from the definition of $\mathcal{B}$ and the inequality follows from Jensen’s inequality.

Now by the assumption that $\mathcal{A}$ is low variance, we have

[TABLE]

∎

Thus using the Lipschitzness of $\widehat{\operatorname{Gap}}$ we obtain

[TABLE]

The first inequality comes from Eqn. (15). The second inequality comes from the Lipschitzness of the gap function. The third inequality comes from Eqn. (16). Thus we ultimately have

[TABLE]

D.2 Stability-Risk Tradeoff

proof of Theorem 4.

Let $f(w;x)=\left\langle w,x\right\rangle$ . Let $0<K<\min\left\{{n,d}\right\}$ be a parameter to be chosen later and define $U=\left\{{\pm 1}\right\}^{K}$ . For any $\boldsymbol{\sigma}\in U$ define $S_{\boldsymbol{\sigma}}=\left\{{L\boldsymbol{\sigma}_{1}e_{1},...,L\boldsymbol{\sigma}_{K}e_{K},0,...,0}\right\}$ , where $e_{j}$ is the $j$ ’th standard basis vector. We will denote $F(w;S_{\boldsymbol{\sigma}})=\frac{1}{n}\sum_{x\in S_{\boldsymbol{\sigma}}}f(w;x)$ . Note that

[TABLE]

Further, for any $\boldsymbol{\sigma}\in U$ , $F(w_{\boldsymbol{\sigma}}^{*};S_{\boldsymbol{\sigma}})=-\frac{BL\sqrt{K}}{n}$ .

By Yao’s minimax principle, it suffices to consider deterministic algorithms and lower bound the expected risk w.r.t. some distribution over the packing. Considering the uniform distribution over the packing and setting $K=\frac{B^{2}}{\Delta^{2}}$ we have

[TABLE]

where $(i)$ comes from the definition of the loss function and the fact that the dataset consists of $K$ standard basis vectors (up to sign) and $n-K$ zero vectors and $(ii)$ comes from the $\Delta=\frac{B}{\sqrt{K}}$ stability property of $\mathcal{A}$ (i.e. $\mathcal{A}(S_{\boldsymbol{\sigma}_{-j}})_{j}-\mathcal{A}(S_{\boldsymbol{\sigma}})_{j}\leq\Delta\implies\mathcal{A}(S_{\boldsymbol{\sigma}})_{j}-\mathcal{A}(S_{\boldsymbol{\sigma}_{-j}})_{j}\geq-\Delta$ ). Finally, note that by the setting of $K$ that $\frac{BL\sqrt{K}}{n}=\frac{B^{2}L}{\Delta n}$ . ∎

Bibliography40

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1ABG + [22] Raman Arora, Raef Bassily, Cristóbal Guzmán, Michael Menart, and Enayat Ullah. Differentially private generalized linear models revisited. In Advances in Neural Information Processing Systems , volume 35. Curran Associates, Inc., 2022.
2ACG + [16] Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. CCS ’16, page 308–318, New York, NY, USA, 2016. Association for Computing Machinery.
3AFKT [21] Hilal Asi, Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: Optimal rates in ℓ 1 subscript ℓ 1 \ell_{1} geometry. In International Conference on Machine Learning , 2021.
4AZ [18] Zeyuan Allen-Zhu. How to make the gradients small stochastically: Even faster convex and nonconvex sgd. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 31. Curran Associates, Inc., 2018.
5BE [02] Olivier Bousquet and André Elisseeff. Stability and generalization. The Journal of Machine Learning Research , 2:499–526, 2002.
6BFTT [19] Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada , pages 11279–11288, 2019.
7BG [23] Digvijay Boob and Cristóbal Guzmán. Optimal algorithms for differentially private stochastic monotone variational inequalities and saddle-point problems. Mathematical Programming , pages 1–43, 2023.
8BGN [21] Raef Bassily, Cristobal Guzman, and Anupama Nandi. Non-euclidean differentially private stochastic convex optimization. In Mikhail Belkin and Samory Kpotufe, editors, Proceedings of Thirty Fourth Conference on Learning Theory , volume 134 of Proceedings of Machine Learning Research , pages 474–499. PMLR, 15–19 Aug 2021.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Differentially Private Algorithms for the Stochastic Saddle Point Problem with Optimal Rates for the Strong Gap

Abstract

1 Introduction

1.1 Contributions

1.2 Related Work

2 Preliminaries

Gap functions

Fact 1**.**

Saddle Operator

Stability

Definition 1**.**

Lemma 1**.**

Lemma 2**.**

Differential Privacy (DP) [12]:

3 From Empirical Saddle Point to Strong Gap Guarantee via Recursive Regularization

Empirical Subroutine

Definition 2** (α^\hat{\alpha}α^-relative accuracy).**

Guarantees of Recursive Regularization

Theorem 1**.**

Lemma 3**.**

Lemma 4**.**

4 Optimal Strong Gap Rate for DP-SSP

Theorem 2**.**

4.1 Efficient algorithm for the non-smooth setting

Lemma 5**.**

Corollary 1**.**

4.2 Near linear time algorithm for the smooth setting

Lemma 6**.**

Theorem 3**.**

proof of Theorem 3.

5 On the Limitations of Previous Approaches

Proposition 1**.**

Proof.

Proposition 2**.**

Tradeoff between Accuracy and Stability

Theorem 4**.**

Acknowledgements

Appendix A Supporting Proofs from Preliminaries

A.1 Lipschitzness of the Gap Function

proof of Fact 1.

A.2 Local Privacy

Corollary 2**.**

Appendix B Missing Results from Section 3

B.1 Proof of Lemma 3

Lemma 7**.**

B.2 Convergence of Recursive Regularization

Theorem 5**.**

Lemma 8**.**

Proof.

Property P.1:

Property P.2:

proof of Theorem 5.

Proof of Theorem 1.

Appendix C Missing Results from Section 4

C.1 Stochastic Gradient Descent Ascent (SGDA)

Lemma 9**.**

Proof.

C.2 Private algorithm for the empirical gap (Noisy SGDA)

Theorem 6**.**

Theorem 7** ([2, 22]).**

Appendix D Missing Result from Section 5

D.1 Low variance and weak gap implies strong gap

proof of Proposition 2.

D.2 Stability-Risk Tradeoff

proof of Theorem 4.

Fact 1.

Definition 1.

Lemma 1.

Lemma 2.

Definition 2 ( $\hat{\alpha}$ -relative accuracy).

Theorem 1.

Lemma 3.

Lemma 4.

Theorem 2.

Lemma 5.

Corollary 1.

Lemma 6.

Theorem 3.

Proposition 1.

Proposition 2.

Theorem 4.

Corollary 2.

Lemma 7.

Theorem 5.

Lemma 8.

Lemma 9.

Theorem 6.

Theorem 7 ([2, 22]).