Differentially Private Algorithms for the Stochastic Saddle Point Problem with Optimal Rates for the Strong Gap
Raef Bassily, Crist\'obal Guzm\'an, Michael Menart

TL;DR
This paper develops differentially private algorithms for stochastic saddle point problems, achieving near-optimal convergence rates and analyzing the tradeoff between stability and accuracy in such settings.
Contribution
It introduces a novel recursive regularization technique for saddle point problems under differential privacy constraints, achieving optimal rates and providing a general algorithm framework.
Findings
Achieves nearly optimal strong gap rates of rac{1}{\u221a{n}} + rac{\u221a{d}}{npsilon}
Develops a general algorithm with rac{ ext{min}igrac{n^2\u2215 ext{epsilon}^{1.5}}{\u221a{d}}, n^{3/2}ig)} gradient complexity
Establishes a fundamental tradeoff between stability and accuracy in differentially private algorithms.
Abstract
We show that convex-concave Lipschitz stochastic saddle point problems (also known as stochastic minimax optimization) can be solved under the constraint of -differential privacy with \emph{strong (primal-dual) gap} rate of , where is the dataset size and is the dimension of the problem. This rate is nearly optimal, based on existing lower bounds in differentially private stochastic optimization. Specifically, we prove a tight upper bound on the strong gap via novel implementation and analysis of the recursive regularization technique repurposed for saddle point problems. We show that this rate can be attained with gradient complexity, and gradient complexity if the loss function is smooth. As a byproduct of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Sparse and Compressive Sensing Techniques
Differentially Private Algorithms for the Stochastic Saddle Point Problem with Optimal Rates for the Strong Gap
Raef Bassily Cristóbal Guzmán Department of Computer Science & Engineering and the Translational Data Analytics Institute (TDAI), The Ohio State University, [email protected]Institute for Mathematical and Computational Engineering, Faculty of Mathematics and School of Engineering, Pontificia Universidad Católica de Chile, [email protected]
Michael Menart Department of Computer Science & Engineering, The Ohio State University, [email protected]
Abstract
We show that convex-concave Lipschitz stochastic saddle point problems (also known as stochastic minimax optimization) can be solved under the constraint of -differential privacy with strong (primal-dual) gap rate of \tilde{O}\big{(}\frac{1}{\sqrt{n}}+\frac{\sqrt{d}}{n\epsilon}\big{)}, where is the dataset size and is the dimension of the problem. This rate is nearly optimal, based on existing lower bounds in differentially private stochastic convex optimization. Specifically, we prove a tight upper bound on the strong gap via novel implementation and analysis of the recursive regularization technique repurposed for saddle point problems. We show that this rate can be attained with O\big{(}\min\big{\{}\frac{n^{2}\epsilon^{1.5}}{\sqrt{d}},n^{3/2}\big{\}}\big{)} gradient complexity, and gradient complexity if the loss function is smooth. As a byproduct of our method, we develop a general algorithm that, given a black-box access to a subroutine satisfying a certain primal-dual accuracy guarantee with respect to the empirical objective, gives a solution to the stochastic saddle point problem with a strong gap of . We show that this -accuracy condition is satisfied by standard algorithms for the empirical saddle point problem such as the proximal point method and the stochastic gradient descent ascent algorithm. Finally, to emphasize the importance of the strong gap as a convergence criterion compared to the weaker notion of primal-dual gap, commonly known as the weak gap, we show that even for simple problems it is possible for an algorithm to have zero weak gap and suffer from strong gap. We also show that there exists a fundamental tradeoff between stability and accuracy. Specifically, we show that any -stable algorithm has empirical gap \Omega\big{(}\frac{1}{\Delta n}\big{)}, and that this bound is tight. This result also holds also more specifically for empirical risk minimization problems and may be of independent interest.
1 Introduction
Stochastic (convex-concave) saddle point problems (SSP)111In this work, we will exclusively focus on the case where the function of interest for the stochastic saddle-point problem is convex-concave, and therefore we will omit it from the problem denomination. (also referred to in the literature as stochastic minimax optimization problems) are an increasingly important model for modern machine learning, arising in areas such as stochastic optimization [27, 19, 39], robust statistics [37], and algorithmic fairness [25, 35].
On the other hand, the reliance of modern machine learning on large datasets has led to concerns of user privacy. These concerns in turn have led to a variety of privacy standards, of which differential privacy (DP) has become the premier standard. However, for a variety of machine learning problems it is known that their differentially-private counterparts have provably worse rates. As such, characterizing the fundamental cost of differential privacy has become an important problem.
Currently, the theory of solving SSPs under differential privacy has major limitations, compared to its non-private counterpart. To illustrate this point, we need to discuss the notions of accuracy used in the literature. In SSPs, the goal is to find an approximate solution of the problem
[TABLE]
where is an unknown distribution for which we have access to an i.i.d. sample . Given a (randomized) algorithm with output , two studied measures of performance are the strong and weak gap222The weak gap is sometimes stated with taken inside the max. However [7] showed this was not necessary to obtain the stability implies generalization result used in various works., defined respectively as
[TABLE]
It is easy to see that the strong gap upper bounds the weak gap, and thus it is a stronger accuracy measure. On the other hand, even for simple problems, the difference between these measures can be ; a fact we elaborate on in Section 5. We also note that the strong gap has a clear game-theoretic interpretation: if we consider and as the actions of two players in a (stochastic) zero-sum game, the strong gap upper bounds the most profitable unilateral deviation for either of the two players. In game theory this is known as an approximate Nash equilibrium. By contrast, there is no general guarantee associated with the weak gap.
Non-privately, it is known how to achieve optimal rates w.r.t. the strong gap, and those rates are similar to those established for stochastic convex optimization (SCO) [27, 19]. However, for DP methods optimal rates are only known for the weak gap [7, 36, 40]. In a nutshell, the main limitation of these approaches is that –in order to amplify privacy– they make multiple passes over the data (e.g., by sampling with replacement stochastic gradients from the dataset), and the existing theory of generalization for SSPs is much more limited than it is for SCO [38, 23, 30]. Our approach largely circumvents the current limitations of generalization theory for SSPs, providing the first nearly-optimal rates for the strong gap in DP-SSP.
1.1 Contributions
In this work, we establish the optimal rates on the strong gap for DP-SSP. In the following, we let be the number of samples, be the dimension, and be the privacy parameters. Our main result is an -DP algorithm for SSP whose strong gap is \tilde{O}\big{(}\frac{1}{\sqrt{n}}+\frac{\sqrt{d}}{n\epsilon}\big{)}. This rate is nearly optimal, due to matching lower bounds for differentially private SCO [9, 6]. These minimization lower bounds hold for saddle point problems since minimization problems are a special case of saddle point problems when is constrained to be a singleton. For non-smooth loss function, we show this rate can be obtained in gradient complexity O\big{(}\min\big{\{}\frac{n^{2}\epsilon^{1.5}}{\sqrt{d}},n^{3/2}\big{\}}\big{)}. This improves even upon the previous best known running time for achieving analogous rates on the weak gap, which was [36]. Furthermore, we show that if the loss function is smooth, this rate can be achieved in nearly linear gradient complexity.
In order to obtain an upper bound for this problem, we present a novel analysis of the recursive regularization algorithm of [4]. Our work is the first to show how the sequential regularization approach can be repurposed to provide an algorithmic framework for attaining optimal strong gap guarantees for DP-SSP. As a byproduct of our analysis, we show that empirical saddle point solvers which satisfy a certain accuracy guarantee can be used as a black box to obtain an guarantee on the strong (population) gap. This class of algorithms includes common techniques such as the proximal point method, the extragradient method, and stochastic gradient descent ascent (SGDA) [24, 26, 19]. This fact may be of interest independent of differential privacy, as to the best of our knowledge, existing algorithms which achieve the optimal rate on the strong population gap rely crucially on a one-pass structure which optimizes the population gap directly [27].
Under the additional assumption that the loss function is smooth, we show that it is possible to use recursive regularization to obtain the optimal strong gap rate in nearly linear time. We here leverage accelerated algorithms for smooth and strongly convex/strongly concave loss functions [31, 20].
Our results stand in contrast to previous work on DP-SSPs, which has achieved optimal rates only for the weak gap and has crucially relied on “stability implies generalization” results for the weak gap. In this vein, we prove that even for simple problems, the strong and weak gap may differ by . We also elucidate the challenges of extending existing techniques to strong gap guarantees by showing a fundamental tradeoff between stability and empirical accuracy. Specifically, we show that even for the more specific case of empirical risk minimization, any algorithm which is -uniform argument stable algorithm must have empirical risk . We also show this bound is tight, and note that it may be of independent interest. Such a tradeoff was also investigated by [10], but their result only implies such a tradeoff for the specific case of and their proof technique is unrelated to ours.
1.2 Related Work
Differentially private stochastic optimization has been extensively studied for over a decade [18, 9, 21, 34, 6, 13, 3, 8]. Among such problems, stochastic convex minimization (where problem parameters are measured in the -norm) is perhaps the most widely studied, where it is known the optimal rate is [6, 9]. Further, under smoothness assumptions such rates can be obtained in linear (in the sample size) gradient complexity [14]. Without smoothness, no linear time algorithms which achieve the optimal rates are known [22].
The study of stochastic saddle point problems under differential privacy is comparatively newer. In the non-private setting, optimal guarantees on the strong gap have been known as far back as [29]. Under privacy (without strong convexity/strong concavity), optimal rates are known only for the weak gap. These rates have been obtained by several works [7, 36, 40]. The work of [40] additionally showed that under smoothness assumptions such a result could be obtained in near linear gradient complexity by leveraging accelerated methods [20, 31]. All of these results are for the weak gap and they rely crucially on the fact that, for the weak gap, -stability implies -generalization [38].
By contrast, for the strong gap (without strong convexity/strong concavity assumptions), the best stability implies generalization result is a bound obtained by [30] provided the loss is smooth. As a result of this discrepancy, known bounds on the strong gap under privacy are worse. The best known rates for the strong gap are [7]. This rate was obtained through of mixture of noisy stochastic extragradient and noisy inexact proximal point methods, avoiding stability arguments altogether and instead relying on one-pass algorithms which optimize the population loss directly. Without smoothness, we are not aware of any work which provides bounds on the strong gap under privacy, but one may note that a straightforward implementation of one-pass noisy SGDA leads to a rate of O\big{(}\frac{\sqrt{d}}{\sqrt{n}\epsilon}\big{)} in this setting. We give these details in Appendix A.2 and note this same algorithm establishes the optimal rate for SSPs under local differential privacy.
Finally, under the stringent assumptions of -strong convexity/strong concavity (-SC/SC) and smoothness with constant condition number, , optimal rates on the strong gap have been obtained [40]. Under these assumptions, the optimal rate of O\big{(}\frac{1}{\mu n}+\frac{d}{\mu n^{2}\epsilon^{2}}\big{)} was achieved by leveraging the fact that stability implies generalization [38]. The lower bound for this rate comes from lower bounds for the minimization setting [17, 6].
2 Preliminaries
Throughout, we consider the space endowed with the standard norm . Let the primal parameter space and the dual parameter space be compact convex sets such that for some . Let be some distribution over data domain . Consider the stochastic saddle-point problem given in equation (1) for some loss function that is convex w.r.t. and concave w.r.t. . We define the corresponding population loss and empirical loss functions as and respectively. For some we assume that . To simplify notation, for vectors and , we will use to denote their concatenation, noting is a vector in . We primarily consider the case where is -Lipschitz, but will also consider the additional assumption of -smoothness for certain results333Throughout, any properties for are considered as a function of . No assumptions about w.r.t. are made.. Specifically, these assumptions are that and :
[TABLE]
Under such assumptions (in fact, smoothness is not necessary), a solution for problem (1) always exists [33], which we will call as a saddle point onwards. Further, given an SSP (1), we will denote a saddle point as .
Gap functions
In addition to the strong and weak gap functions defined in equations (2) and (3), it will be useful to define the following gap function expressed as a function of the parameter vector instead of the algorithm,
We have the following useful fact regarding (see Appendix A for a proof).
Fact 1**.**
If is -Lipschitz then is -Lipschitz.
Note the strong gap can be written as an expectation of the gap function. Further, since the gap function is zero if and only if is a solution for problem (1), the strong gap is considered the most suitable measure of accuracy for SSPs [28, 19]. We also define the empirical gap as, We will consider at various points the notion of generalization error with respect to the strong/weak gap, which refers to difference between the strong/weak gap and the empirical gap. Note that because the empirical gap treats the dataset as a fixed quantity, there are not differing strong and weak versions of the empirical gap.
Saddle Operator
Define the saddle operator as Similarly define and . Note that the assumption on the smoothness of implies the Lipschitzness of . We note that since the saddle operator can be computed using one computation of the gradient, we refer indistinctly to saddle operator complexity or gradient complexity when discussing the running time of our algorithms.
Stability
We will also use the notion of uniform argument stability frequently in our analysis [5].
Definition 1**.**
A randomized algorithm satisfies -uniform argument stability if for any pair of adjacent datasets it holds that .
A fact we will use is that the (constrained) regularized saddle-point is stable. Specifically, for some , , and consider the regularized objective function
[TABLE]
It is easy to see that his problem has a unique saddle point. The mapping which selects its output according the unique solution of (4) has the following stability property.
Lemma 1**.**
[38, Lemma 1]** The algorithm which outputs the regularized saddle point with parameters , and , is \big{(}\frac{2L}{\lambda n}\big{)}-uniform argument stable w.r.t. .
In addition to the stability of the regularized saddle point, we will also frequently use the following fact.
Lemma 2**.**
[38*, Theorem 1]**
Let be -SC/SC with saddle point and gap function . For any it holds that .*
Differential Privacy (DP) [12]:
An algorithm is -differentially private if for all datasets and differing in one data point and all events in the range of the , we have, .
3 From Empirical Saddle Point to Strong Gap Guarantee via Recursive Regularization
Our approach for obtaining near optimal rates on the strong gap leverages the recursive regularization technique of [4]. In addition to adapting this algorithm to fit SSP problems, we also provide a novel analysis which differs substantially from the analysis presented in previous work [16, 1].
Our recursive regularization algorithm works by solving a series of regularized objectives, , with increasingly large regularization parameters. Specifically, after solving the ’th objective to obtain , the algorithm creates a new objective which is for the subsequent round. Notice that each subsequent objective is easier in the sense that the strong convexity parameter is larger.
Our analysis will leverage the fact that approximate solutions to intermediate objectives do not need to obtain good bounds on the strong gap for the regularization parameter to be increased. This is in contrast to, for example, the iterative regularization technique of [40], which finds that satisfies a near optimal (weak) gap bound before adding noise.
Empirical Subroutine
Recursive regularization utilizes a subroutine, , which is roughly an approximate empirical saddle point solver. In addition to a dataset and Lipschitz loss function, takes as input an initial point and a bound, , on the expected distance between the initial point and the saddle point of the empirical loss defined over the input dataset. At round this distance is bounded by , allowing the algorithm to obtain increasingly strong accuracy guarantees for each subproblem. Note also it can be verified that for all , is -Lipschitz due the scaling of the regularization. Specifically, the accuracy guarantee of interest is the following.
Definition 2** (-relative accuracy).**
Given a dataset , loss function , and an initial point we say that satisfies -relative accuracy w.r.t. the empirical saddle point of if, , whenever , the output of satisfies .
The relative accuracy guarantee for differs from the more standard gap guarantee, and is not necessarily implied by a bound on the empirical gap. The motivation for this notion of accuracy is twofold. First, when the loss function is additionally SC/SC, this guarantee is sufficient to provide a bound on the distance between the output of and the saddle point, which will play a crucial role in our convergence proof for Algorithm 1. Second, while it is certainly true that a bound on the empirical gap implies the same bound on , for any given , it is not necessarily the case that the gap itself may enjoy a bound that is proportional to the initial distance to the saddle point444[15, Theorem 4] claims such a bound on the primal risk, but this is due to a misapplication of [24, Lemma 2].. The reason is that the gap function is defined by a supremum that is taken w.r.t. the whole feasible set , and thus the information of the evaluation of the objective w.r.t. particular points is lost. However, it is usually the case that saddle point solvers provide a bound of the form , for all , and some initial point . Algorithms such as the proximal point method, extragradient method, and SGDA (with appropriately tuned learning rate) satisfy this condition, and thus satisfy the condition for relative accuracy [24, 26, 19].
Guarantees of Recursive Regularization
Given such an algorithm, recursive regularization achieves the following guarantee.
Theorem 1**.**
Let satisfy -relative accuracy for any -Lipschitz loss function and dataset of size . Then Algorithm 1, run with as a subroutine and , satisfies
[TABLE]
Recall that is a bound on the diameter of the constraint set. In the following, we will sketch the proof of this theorem and highlight key lemmas. We defer the full proof to Appendix B.2. For simplicity, let us here consider the case where . A crucial aspect of our proof is that we avoid the need to bound the strong gap of the actual iterates, . Instead, we bound the strong gap of the expected iterates, where the expectation is taken with respect to . More concretely, consider some and let be the algorithm which on input outputs . Note is deterministic and data independent. As a result, it is possible to prove bounds on the strong gap of .
Lemma 3**.**
Let . For any -uniform argument stable algorithm , it holds that
[TABLE]
The proof follows straightforwardly from an application of Jensen’s inequality and the “stability implies generalization” result for the weak gap [23, Theorem 1]. We give full details in Appendix B.1. Note that, for this discussion, the LHS of the above is equal to when we apply this lemma to the data batch and subroutine .
In fact, running is infeasible. Instead, we show that the output is close to the output of . This in turn can be accomplished using the fact that bounded stability implies bounded variance. Concretely, we use the vector valued version of McDiarmid’s inequality.
Lemma 4**.**
[32, Lemma 6]** 555Although stated therein for the distance, the last step of their proof shows a squared distance bound can be obtained. Let be deterministic -uniform argument stable stable with respect to . Then its output satisfies \mathbb{E}\Big{[}\big{\|}\mathcal{A}(S)-\mathbb{E}_{\hat{S}\sim\mathcal{D}^{n}}\big{[}\mathcal{A}(\hat{S})\big{]}\big{\|}^{2}\Big{]}\leq n\Delta^{2}.
Observe that the exact empirical saddle point is a deterministic quantity conditioned on the randomness of the ’th empirical objective. Using the fact that -regularization implies \big{(}\frac{L}{2^{t}\lambda n^{\prime}}\big{)}-stability of the empirical saddle point in conjunction with the above lemma, we obtain a (conditional) variance bound of . Under the setting of , we can ultimately prove that the distance between the output of and (at round ) is . Since the strong gap of with respect to is at most by Lemma 3 (recall we here assume for simplicity) and is -SC/SC, the output of must in turn be close to the population saddle point. Specifically, this distance is also bounded as \big{(}\frac{\Delta L}{2^{t}\lambda}\big{)}^{1/2}=\frac{L}{\sqrt{2^{t}\lambda n^{\prime}}}\frac{1}{\sqrt{2^{t}\lambda}}=O(\frac{B}{2^{t}}). Thus we ultimately have that the distance between and the population saddle point of , , satisfies . These ideas also lead to a bound , although the argument in this case is more technical and thus deferred to the full proof.
The upshot of this analysis is that as the level of regularization increases, the distance of the iterates to the their respective population minimizers decreases in kind. One consequence of this fact is that , and thus by the Lipschitzness of the gap function, the output of recursive regularization has a gap bound close to that of ]. Turning now towards the utility of , using the fact that is convex-concave we have
[TABLE]
Further, an expression for be obtained using the definition of :
[TABLE]
where is the saddle operator of . Plugging the latter into the former and using Cauchy-Schwarz inequality, the triangle inequality, and the fact that is the exact saddle point of , one can obtain a bound on the gap in terms of the distances discussed previously.
[TABLE]
where step comes from a triangle inequality and step is obtained from a series of algebraic manipulations which are expanded upon in the full proof. Finally, in the case where , extra steps are required to bound the distance of output of to the exact saddle point of . This is accomplished using the SC/SC property of and the -relative accuracy guarantee of .
4 Optimal Strong Gap Rate for DP-SSP
With the guarantees of recursive regularization established, what remains is to show there exist -DP algorithms which achieve a sufficient accuracy on the empirical objective. Note this suffices to make the entire recursive regularization algorithm private.
Theorem 2**.**
Let used in Algorithm 1 be -DP. Then Algorithm 1 is -DP.
This follows simply from post processing the parallel composition theorem for differential privacy, since each run of is run on a disjoint partition of the dataset.
4.1 Efficient algorithm for the non-smooth setting
In the non-smooth setting, one can obtain optimal rates on the empirical gap using noisy stochastic gradient descent ascent (noisy SGDA). We give this algorithm in detail in Appendix C.2. More briefly, noisy SGDA starts at and takes parameters , where is the number of iterations and is the learning rate. New iterates are obtained via the update rule , where are i.i.d. Gaussian noise vectors and is a minibatch sampled uniformly with replacement from . The algorithm then returns the average iterate, . Noisy SGDA can be used to obtain the following result.
Lemma 5**.**
There exists an -DP algorithm which satisfies -relative accuracy with and runs in gradient evaluations.
Applying Theorem 1 then yields a near optimal rate on the strong gap.
Corollary 1**.**
There exists an Algorithm, , which is -DP, has gradient evaluations bounded by O\big{(}\min\big{\{}\frac{n^{2}\epsilon^{1.5}}{\log(n)\sqrt{d\log(1/\delta)}},\frac{n^{3/2}}{\sqrt{\log(n)}}\big{\}}\big{)}, and satisfies
[TABLE]
4.2 Near linear time algorithm for the smooth setting
In the smooth setting, we can achieve the optimal rate in nearly linear time. Our result leverages accelerated algorithms for smooth and strongly convex-strongly concave saddle point problems [20, 31].
Lemma 6**.**
(JST [20, Theorem 3, Corollary 41]) Let be -smooth and . Let both and be -strongly convex and -smooth functions for some and constants . Consider the objective . Then there exists an algorithm which finds an approximate saddle point of with empirical gap at most in gradient evaluations, where .
Given this, we consider the following implementation of . Define to be the saddle point of for all . At round , find a point such that \underset{}{\mathbb{E}}\left[\|[\hat{w}_{t},\hat{\theta}_{t}]-[w_{S,t}^{*},\theta_{S,t}^{*}]\|^{2}\right]\leq\big{(}\frac{\delta}{5}\cdot\frac{L}{2^{t}\lambda n^{\prime}}\big{)}^{2}. We can find this point efficiently using the algorithm from [20] referenced above. Then output where and . This implementation gives us the following result.
Theorem 3**.**
Let be as described above. Then Algorithm 1 is -DP and when run with satisfies
[TABLE]
and runs in at most gradient evaluations with .
proof of Theorem 3.
In the following, we start by proving the privacy guarantee. Then, we prove the utility guarantee, and finish by verifying the running time of the algorithm.
Privacy Guarantee: Consider any and fix . The stability of the regularized saddle point at round , , is then by Lemma 1. Since guarantees that , we have by Markov’s inequality that with probability at least that . Thus with probability at least , generating satisfies uniform argument stability. Thus Gaussian noise of scale ensures the round is -DP. Parallel composition then ensures the entire algorithm is -DP since each phase acts on a disjoint partition of the dataset.
Utility Guarantee: We now turn to the accuracy guarantee. Specifically, we leverage the generalized convergence guarantee of Algorithm 1 given by Theorem 5 in Appendix B. This theorem guarantees that so long as the distance condition is satisfied for all , one obtains convergence guarantee . That is, after the distance guarantee is established, the rest of the analysis (i.e. the proof of Theorem 5) follows the same lines as in the non-smooth case. Note under the setting of in Theorem 3 we have
[TABLE]
Thus all that remains is to show that the distance condition, , is satisfied for all . In this regard we have,
[TABLE]
For the first inequality, observe that the noise vector is uncorrelated with the vectors, and . For the second inequality note . Further, is bounded due to the chosen implementation of . The third inequality comes from the settings of and the fact that . The last inequality uses the fact that .
Running Time: One can ensure that overall algorithm runs in nearly linear time by leveraging accelerated methods to find the point . The description of requires that at each phase , one has \underset{}{\mathbb{E}}\left[\|[\hat{w}_{t},\hat{\theta}_{t}]-[w_{S,t}^{*},\theta_{S,t}^{*}]\|^{2}\right]\leq\big{(}\frac{\delta}{5}\cdot\frac{L}{2^{t}\lambda n^{\prime}}\big{)}^{2}, which by Lemma 2 is satisfied if the empirical gap is at most \lambda\big{(}\frac{\delta}{5}\cdot\frac{L}{2^{t}\lambda n^{\prime}}\big{)}^{2}=\frac{\delta^{2}}{25}\cdot\frac{L^{2}}{2^{2t}\lambda(n^{\prime})^{2}}. For simplicity, we observe that
[TABLE]
We now apply Lemma 6 with , , and for some sufficiently small constant . This gives that the running time of phase is , where . Running this implementation of each phase incurs an extra factor of , giving the claimed running time bound of , where . ∎
5 On the Limitations of Previous Approaches
Prior work into DP SSPs has largely focused on the weak gap criteria. In this section, we provide further investigation into both the importance and challenges of bounding the strong gap over the weak gap. We start by considering a natural question. Do there exist cases where the strong and weak gap differ substantially? We answer this question affirmatively in the following.
Proposition 1**.**
There exists a convex-concave function with range and algorithm such that .
Our construction shows that this result holds even for a simple one dimensional bilinear problem.
Proof.
Consider the loss function , where Let be the uniform distribution over . For consider the algorithm which outputs as the mode of the first half of the samples in and similarly is set as the mode of the second half of the samples in 666Without much loss of generality, we assume that is divisible by 2 but not by 4, so that the mode of each half of the data are well-defined and belong to .. Note and are independent and distributed uniformly over (under the randomness from ).
Now, since is a deterministic function of the dataset, the randomness in comes only from . Thus for the weak gap we have which evaluates to However, one can see for the strong gap we have , where the first equality comes from evaluating and in the maximization and minimization operators. ∎
Observe that the generalization error w.r.t. the strong gap of this algorithm is always [math] because the loss function does not depend on the random sample from . The discrepancy between the gaps instead comes from the fact that having the expectation w.r.t. inside the max/min changes the function over which the dual/primal adversary is maximizing/minimizing. Specifically, note here that the weak gap measures the ability of to maximize the function for , but note does not occur for any realization of the dataset .
One might further observe that a key attribute of this construction is the high variance of the parameter vectors. One can show such behavior is in fact necessary to see such a separation; the full proof of the following is statement is given in Appendix D.1.
Proposition 2**.**
Let be an algorithm such that then if is -Lipschitz it holds that
Tradeoff between Accuracy and Stability
An additional consequence of Proposition 2 (in conjunction with Lemma 5) is that -uniform argument stability implies generalization bound w.r.t. the strong gap that does not rely on smoothness (in contrast to the bound of [30] which does). We leave determining tight bounds for stability implies generalization on the strong gap as an interesting direction for future work. In this section however, we show that stronger upper bounds are likely necessary to obtain a more direct algorithm for DP-SSPs. In fact, our key result holds even for empirical risk minimization (ERM) problems. That is, for and , consider the problem of minimizing the excess empirical risk , where . We have the following.
Theorem 4**.**
For any (possibly randomized) algorithm which is -uniform argument stable, there exists a [math]-smooth -Lipschitz loss function, , and dataset such that provided .
The proof can be found in Appendix D.2. Lemma 1 shows this bound is tight for both ERM and empirical saddle point problems. Generalization bounds are only useful when it is possible to obtain good empirical performance. Thus, the implication of this bound is that generalization error which is is necessary to obtain the optimal statistical rate. To elaborate, let characterize some (potentially suboptimal) generalization bound for stable algorithms and assume . To then bound the sum of empirical risk and generalization error, Theorem 4 implies Note the RHS is asymptotically larger than (i.e. not optimal) for any .
Acknowledgements
RB’s and MM’s research is supported by NSF CAREER Award 2144532 and NSF Award AF-1908281. CG’s research was partially supported by INRIA Associate Teams project, FONDECYT 1210362 grant, ANID Anillo ACT210005 grant, and National Center for Artificial Intelligence CENIA FB210017, Basal ANID.
Appendix A Supporting Proofs from Preliminaries
A.1 Lipschitzness of the Gap Function
proof of Fact 1.
For any we have
[TABLE]
where we used in the last inequality that . ∎
A.2 Local Privacy
In the case of local differential privacy (LDP), a simple implementation of noisy SGDA (see Appendix C.1) suffices to obtain the optimal rate. We defer the reader to DJW [11] for a discussion of LDP and the matching lower bound. Consider the implementation of SGDA which defines the saddle estimator as
[TABLE]
where and and is sampled without replacement from . By Lemma 9 we have the following.
Corollary 2**.**
Let . Then the algorithm described above, denoted as , is -LDP and if the average iterate, , satisfies
Appendix B Missing Results from Section 3
B.1 Proof of Lemma 3
The first inequality follows from an application of Jensen’s inequality.
[TABLE]
The second inequality in the theorem statement then follows from stability implies generalization result for the weak gap, for which we provide a restatement below.
Lemma 7**.**
[23, Theorem 1]**, [7, Proposition 2.1] Let the loss function be -Lipschitz and the algorithm be -uniform argument stable. Then
B.2 Convergence of Recursive Regularization
In this section we prove the following more general statement of Theorem 1, which will be useful later.
Theorem 5**.**
Let and be such that for all it holds that . Then Recursive Regularization satisfies
[TABLE]
To prove this result, it will be helpful to first show several intermediate results. We start by defining several useful quantities. Define as the filtration where is the sigma algebra induced by all randomness up to . For every we define
- •
saddle point of ;
- •
saddle point of ;
- •
[\widetilde{w}_{t},\widetilde{\theta}_{t}]:=\underset{}{\mathbb{E}}\left[[w^{*}_{S,t},\theta^{*}_{S,t}]\Big{|}\mathcal{F}_{t-1}\right];
- •
the gap function w.r.t. ; and,
- •
the empirical gap function.
We now establish two distance inequalities which will be used when analyzing the final gap bound in Theorem 5. The first inequality above bounds the distance of the output of the -th round to the minimizer of . The second inequality bounds the distance of the minimizer of to the most recent regularization point.
Lemma 8**.**
Assume the conditions of Theorem 5 hold. Then for every , the following holds
- P.1
; and, 2. P.2
.
Proof.
We will prove both properties via induction on . Specifically, for each we will introduce three terms , and show that these terms are bounded if the bound on holds and that holds if are bounded. Property P.1 is then established as a result of the fact that . Note that holds as the base case because .
Property P.1:
We here prove that if is sufficiently bounded, then are bounded where for we define
[TABLE]
Additionally, this will establish property P.1 because for any it holds that,
[TABLE]
The second inequality comes from the strong convexity-strong concavity of the loss.
Bounding : We have that is bounded by the assumption made in the statement of Theorem 5.
Bounding :
[TABLE]
The first inequality comes from the stability of the regularized minimizer and Lemma 5. The second inequality comes from the setting of .
Bounding : We have
[TABLE]
The first equality comes from the definition of . The first inequality comes from Lemma 3, where we consider the algorithm stated in the lemma to be the algorithm which outputs the exact regularized minimizer. Note this algorithm is stable. The second equality comes from the fact that is the exact empirical saddle point. The final inequality uses the same analysis as in Eqn. (7).
We thus have a final bound .
Property P.2:
Now assume holds. We have
[TABLE]
Above and are as defined in (5). We bound the remaining squared distance term in the following. First, note that the primal function is strongly convex and it holds that . Similar facts hold for . Thus we have
[TABLE]
The second inequality comes from removing the negative norm terms. The third inequality comes from the definition of and . The second to last inequality comes from the definition of , as given in Eqn. (5). Plugging this result into (8) and using the previously established bounds on (which hold under the assumed bound on ) we have
[TABLE]
∎
We now turn to analyzing the utility of the algorithm to complete the proof.
proof of Theorem 5.
Using the fact that is -Lipschitz and property P.1, we have
[TABLE]
What remains is showing is . Let and . Using the fact that is convex-concave we have
[TABLE]
where is the population loss saddle operator. Further by the definition of and denoting as the saddle operator for we have
[TABLE]
Thus plugging the above into Eqn. (10) we have
[TABLE]
Above, the second inequality comes from the first order optimally conditions for , the third from Cauchy Schwartz and a triangle inequality. The final equality uses the definition of the Euclidean norm and the fact that for any , .
Taking the expectation on both sides of the above we have the following derivation,
[TABLE]
Above, and the following inequality both come from the triangle inequality. Equality is obtained by rearranging the sums. Inequality comes from applying properties P.1 and P.2 proved above. The last equality comes from the setting of and .
Now using this result in conjunction with Eqn. (9) we have
[TABLE]
Above we use the fact that and , and thus . ∎
Finally, we prove Theorem 1 leveraging the relative accuracy assumption.
Proof of Theorem 1.
First, observe that under the setting of used in the theorem statement that . Thus what remains is to show that the distance condition required by Theorem 5 holds. That is, we now show that if satisfies -relative accuracy, then for all it holds that .
To prove this property, we must leverage the induction argument made by Lemma 8. Specifically, to prove the condition holds for some , assume (recall the base case for trivially holds). As shown in the proof of Lemma 8, this implies that the quantities (as defined in 5) are bounded by . We thus have
[TABLE]
where is as defined in property P.2. Inequality comes from Lemma 2. Inequality comes from the -relative accuracy assumption on , and the fact that each is -Lipschitz. That is, observe
[TABLE]
Inequality comes from a triangle inequality and the definition of and . Inequality comes from the induction hypothesis (specifically property P.2) and the bounds on and established above. The last inequality in Eqn. (B.2) comes from the setting . ∎
Appendix C Missing Results from Section 4
C.1 Stochastic Gradient Descent Ascent (SGDA)
Let have saddle operator and associated strong gap . We define the SGDA algorithm in the following manner. Let . Let be any vector in . SGDA uses the following update rule. For let be a random vector (which may depend on and ) that is a unbiased estimate of conditional on and has bounded variance. We define
[TABLE]
where is the orthogonal projection onto . The output of SGDA is defined to be
[TABLE]
We have the following result for the convergence of SGDA.
Lemma 9**.**
Assume that and , then the algorithm, , that is SGDA run with parameters satisfies for any and ,
[TABLE]
This result is somewhat implicit in YHL*+* [36, Lemma 3], but for completeness we provide a short proof here.
Proof.
By the convexity-concavity of we have for any that
[TABLE]
and thus taking the expectation (conditional on ) and using the fact that each is unbiased we have
[TABLE]
Using and the fact that the projection is nonexpansive, we have
[TABLE]
where in the first equality we use that , due to the unbiasedness of the stochastic oracle.
Summing over all iterations and taking the average we obtain for the average iterate, , and any that
[TABLE]
∎
C.2 Private algorithm for the empirical gap (Noisy SGDA)
We here provide an implementation of SGDA (see Appendix C.1 above) which is differentially private and yields convergence guarantees for the empirical gap. Let each be a batch of samples, each sampled uniformly with replacement from . Let for some universal constant and each be sampled i.i.d. from . We define
[TABLE]
Notice that as defined above satisfies the assumptions for Lemma 9 with respect to the empirical saddle operator, , for some finite .
We have the following result for SGDA run with this stochastic oracle.
Theorem 6**.**
Let such that . Let be the algorithm SGDA run with as described above, , and . Algorithm is -DP, has gradient complexity , and satisfies
[TABLE]
The proof of the utility guarantee follows directly from applying Lemma 9 with . The proof of the privacy guarantee relies on the moments accountant analysis, for which we provide the following restatement.
Theorem 7** ([2, 22]).**
Let and be a universal constant. Let be a dataset over some domain , and let be a series of (possibly adaptive) queries such that for any , , . Let and . Then the algorithm which samples batches of size of size uniformly at random and outputs for all where , is -DP.
It can be verified for the described noisy SGDA implementation that and and thus the algorithm is -DP.
Appendix D Missing Result from Section 5
D.1 Low variance and weak gap implies strong gap
proof of Proposition 2.
Consider the virtual algorithm, . Note this algorithm is deterministic and does not depend on any specific dataset drawn from . We first show that gap function at the output of is bounded by the weak gap of . We have
[TABLE]
where the second equality follows from the definition of and the inequality follows from Jensen’s inequality.
Now by the assumption that is low variance, we have
[TABLE]
∎
Thus using the Lipschitzness of we obtain
[TABLE]
The first inequality comes from Eqn. (15). The second inequality comes from the Lipschitzness of the gap function. The third inequality comes from Eqn. (16). Thus we ultimately have
[TABLE]
D.2 Stability-Risk Tradeoff
proof of Theorem 4.
Let . Let be a parameter to be chosen later and define . For any define , where is the ’th standard basis vector. We will denote . Note that
[TABLE]
Further, for any , .
By Yao’s minimax principle, it suffices to consider deterministic algorithms and lower bound the expected risk w.r.t. some distribution over the packing. Considering the uniform distribution over the packing and setting we have
[TABLE]
where comes from the definition of the loss function and the fact that the dataset consists of standard basis vectors (up to sign) and zero vectors and comes from the stability property of (i.e. ). Finally, note that by the setting of that . ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1ABG + [22] Raman Arora, Raef Bassily, Cristóbal Guzmán, Michael Menart, and Enayat Ullah. Differentially private generalized linear models revisited. In Advances in Neural Information Processing Systems , volume 35. Curran Associates, Inc., 2022.
- 2ACG + [16] Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. CCS ’16, page 308–318, New York, NY, USA, 2016. Association for Computing Machinery.
- 3AFKT [21] Hilal Asi, Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: Optimal rates in ℓ 1 subscript ℓ 1 \ell_{1} geometry. In International Conference on Machine Learning , 2021.
- 4AZ [18] Zeyuan Allen-Zhu. How to make the gradients small stochastically: Even faster convex and nonconvex sgd. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 31. Curran Associates, Inc., 2018.
- 5BE [02] Olivier Bousquet and André Elisseeff. Stability and generalization. The Journal of Machine Learning Research , 2:499–526, 2002.
- 6BFTT [19] Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada , pages 11279–11288, 2019.
- 7BG [23] Digvijay Boob and Cristóbal Guzmán. Optimal algorithms for differentially private stochastic monotone variational inequalities and saddle-point problems. Mathematical Programming , pages 1–43, 2023.
- 8BGN [21] Raef Bassily, Cristobal Guzman, and Anupama Nandi. Non-euclidean differentially private stochastic convex optimization. In Mikhail Belkin and Samory Kpotufe, editors, Proceedings of Thirty Fourth Conference on Learning Theory , volume 134 of Proceedings of Machine Learning Research , pages 474–499. PMLR, 15–19 Aug 2021.
