Generalization Error Bounds for Noisy, Iterative Algorithms via Maximal Leakage
Ibrahim Issa, Amedeo Roberto Esposito, Michael Gastpar

TL;DR
This paper introduces an information-theoretic approach using maximal leakage to analyze the generalization error of iterative, noisy algorithms like SGLD, providing explicit bounds and insights on noise and update functions.
Contribution
It develops a semi-closed form bound on maximal leakage for noisy iterative algorithms, linking update function properties and noise to generalization performance.
Findings
Bound on maximal leakage with Gaussian noise and bounded update functions
Explicit tight bounds for various scenarios
Insights on optimal noise choice for minimizing leakage
Abstract
We adopt an information-theoretic framework to analyze the generalization behavior of the class of iterative, noisy learning algorithms. This class is particularly suitable for study under information-theoretic metrics as the algorithms are inherently randomized, and it includes commonly used algorithms such as Stochastic Gradient Langevin Dynamics (SGLD). Herein, we use the maximal leakage (equivalently, the Sibson mutual information of order infinity) metric, as it is simple to analyze, and it implies both bounds on the probability of having a large generalization error and on its expected value. We show that, if the update function (e.g., gradient) is bounded in -norm and the additive noise is isotropic Gaussian noise, then one can obtain an upper-bound on maximal leakage in semi-closed form. Furthermore, we demonstrate how the assumptions on the update function affect theâŠ
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Neural Networks and Applications
\coltauthor\Name
Ibrahim Issa \[email protected]
\addrAmerican University of Beirut, Lebanon, and Ăcole Polytechnique FĂ©dĂ©rale de Lausanne, Switzerland and \NameAmedeo Roberto Esposito \[email protected]
\addrInstitute of Science and Technology Austria and \NameMichael Gastpar \[email protected]
\addrĂcole Polytechnique FĂ©dĂ©rale de Lausanne, Switzerland
Generalization Error Bounds for Noisy, Iterative Algorithms via Maximal Leakage
Abstract
We adopt an information-theoretic framework to analyze the generalization behavior of the class of iterative, noisy learning algorithms. This class is particularly suitable for study under information-theoretic metrics as the algorithms are inherently randomized, and it includes commonly used algorithms such as Stochastic Gradient Langevin Dynamics (SGLD). Herein, we use the maximal leakage (equivalently, the Sibson mutual information of order infinity) metric, as it is simple to analyze, and it implies both bounds on the probability of having a large generalization error and on its expected value. We show that, if the update function (e.g., gradient) is bounded in -norm and the additive noise is isotropic Gaussian noise, then one can obtain an upper-bound on maximal leakage in semi-closed form. Furthermore, we demonstrate how the assumptions on the update function affect the optimal (in the sense of minimizing the induced maximal leakage) choice of the noise. Finally, we compute explicit tight upper bounds on the induced maximal leakage for other scenarios of interest.
keywords:
Noisy iterative algorithms, generalization error, maximal leakage, Gaussian noise
1 Introduction
One of the key challenges in machine learning research concerns the âgeneralizationâ behavior of learning algorithms. That is: if a learning algorithm performs well on the training set, what guarantees can one provide on its performance on new samples?
While the question of generalization is understood in many settings (Bousquet et al., 2003; Shalev-Shwartz and Ben-David., 2014), existing bounds and techniques provide vacuous expressions when employed to show the generalization capabilities of deep neural networks (DNNs) (Bartlett et al., 2017, 2019; Jiang et al., 2020; Zhang et al., 2021). In general, classical measures of model expressivity (such as Vapnik-Chervonenkis (VC) dimension (Vapnik and Chervonenkis, 1991), Rademacher complexity (Bartlett and Mendelson, 2003), etc.) fail to explain the generalization abilities of DNNs due to the fact that they are typically over-parameterized models with less training data than model parameters. A novel approach was introduced by (Russo and Zou, 2016), and (Xu and Raginsky, 2017) (further developed by Steinke and Zakynthinou (2020); Bu et al. (2020); Esposito et al. (2021); Esposito and Gastpar (2022) and many others), where information-theoretic techniques are used to link the generalization capabilities of a learning algorithm to information measures. These quantities are algorithm-dependent and can be used to analyze the generalization capabilities of general classes of updates and models e.g., noisy iterative algorithms such as the Stochastic Gradient Langevin Dynamics (SGLD) (Pensia et al., 2018; Wang et al., 2021), which can thus be applied to deep learning settings. Moreover, it has been shown that information-theoretic bounds can be non-vacuous and reflect the real generalization behavior even in deep learning settings (Dziugaite and Roy, 2017; Zhou et al., 2018; Negrea et al., 2019; Haghifam et al., 2020).
In this work we adopt and expand the framework introduced by Pensia et al. (2018), but instead of focusing on the mutual information between the input and output of an iterative algorithm, we compute the maximal leakage (Issa et al., 2020). Maximal leakage, together with other information measures of the Sibson/Rényi family (maximal leakage can be shown to be Sibson Mutual information of order infinity (Issa et al., 2020)), have been linked to high-probability bounds on the generalization error (Esposito et al., 2021). In particular, given a learning algorithm trained on data-set (made of samples), one can provide the following guarantee in the case of the loss:
[TABLE]
where is defined in equation (2) below. This deviates from much of the literature in which the focus is on bounding the expected generalization error instead (Xu and Raginsky, 2017; Steinke and Zakynthinou, 2020). Consequently, if one can guarantee that for a class of algorithms, the maximal leakage between the input and the output is bounded, then one can provide an exponentially decaying (in the number of samples ) bound on the probability of having a large generalization error. This is in general not true for mutual information, which can typically only guarantee a linearly decaying bound on the probability of the same event (Bassily et al., 2018). Moreover, a bound on maximal leakage implies a bound on mutual information (cf. Equation 6) and, consequently, a bound on the expected generalization error of (exploiting the link between mutual information and expected generalization error (Xu and Raginsky, 2017)). The main advantage of maximal leakage lies in the fact that it depends on the distribution of the samples only through its support. It is thus naturally independent from the distribution over the samples and particularly amenable to analysis, especially in additive noise settings.
The contributions of this work can be summarized as follows:
- âą
we derive novel bounds on whenever is a noisy, iterative algorithm (SGLD-like), which then implies the first bounds showing generalization with high-probability of said mechanisms;
- âą
we leverage the analysis to extrapolate to optimize the type of noise to be added (in the sense of minimizing the induced maximal leakage), based on the assumptions imposed on the algorithm. In particular, if one assumes the norm of the gradient to be bounded, then adding uniform noise minimizes the maximal leakage upper bound. Hence, the analysis and computation of maximal leakage can also be used to inform the design of novel noisy, iterative algorithms.
1.1 Related Work
The line of work exploiting information measures to bound the expected generalization started in (Russo and Zou, 2016; Xu and Raginsky, 2017) and was then refined with a variety of approaches considering Conditional Mutual Information (Steinke and Zakynthinou, 2020; Haghifam et al., 2020), the Mutual Information between individual samples and the hypothesis (Bu et al., 2019) or improved versions of the original bounds (Issa et al., 2019; Hafez-Kolahi et al., 2020). Other approaches employed the Kullback-Leibler Divergence with a PAC-Bayesian approach (McAllester, 2013; Zhou et al., 2018). Moreover, said bounds were then characterized for specific SGLD-like algorithms, denoted as ânoisy, iterative algorithmsâ and used to provide novel, non-vacuous bounds for Neural Networks (Pensia et al., 2018; Negrea et al., 2019; Haghifam et al., 2020; Wang et al., 2023) as well as for SGD algorithms (Neu et al., 2021). Recent efforts tried to provide the optimal type of noise to add in said algorithms and reduce the (empirical) gap in performance between SGLD and SGD (Wang et al., 2021). All of these approaches considered the KL-Divergence or (variants of) Shannonâs Mutual Information. General bounds on the expected generalization error leveraging arbitrary divergences were given in (Esposito and Gastpar, 2022; Lugosi and Neu, 2022). Another line of work considered instead bounds on the probability of having a large generalization error (Bassily et al., 2018; Esposito et al., 2021; Hellström and Durisi, 2020) and focused on large families of divergences and generalizations of the Mutual Information (in particular of the Sibson/RĂ©nyi-family, including conditional versions).
2 Preliminaries, Setup, and a General Bound
2.1 Preliminaries
2.1.1 Information Measures
The main building block of the information measures considered in this work is the RĂ©nyiâs -divergence between two measures and , (which can be seen as a parametrized generalization of the Kullback Leibler-divergence) (van Erven and HarremoĂ«s, 2014, Definition 2). Starting from RĂ©nyiâs Divergence and the geometric averaging that it involves, Sibson built the notion of Information Radius (Sibson, 1969) which can be seen as a special case of the following quantity (VerdĂș, 2015): Sibsonâs represents a generalization of Shannonâs mutual information, indeed one has that: Differently, when , one gets:
[TABLE]
where denotes the maximal leakage from to , a recently defined information measure with an operational meaning in the context of privacy and security (Issa et al., 2020). Maximal leakage represents the main quantity of interest for the scope of this paper, as it is amenable to analysis and has been used to bound the generalization error (Esposito et al., 2021). As such, we will bound the maximal leakage between the input and output of generic noisy iterative algorithms.
To that end, we mention a few useful properties of . If and are jointly continuous random variables, then (Issa et al., 2020, Corollary 4)
[TABLE]
where is the conditional pdf of given . Moreover, maximal leakage satisfies the following chain rule (the proof of which is given in Appendix A):
Lemma 2.1**.**
Given a triple of random variables , then
[TABLE]
where the conditional maximal leakage , where the latter term is interpreted as the maximal leakage from to with respect to the distribution . Consequently, for random variables ,
[TABLE]
Moreover, one can relate to through . Indeed, an important property of is that it is non-decreasing in , hence for every :
[TABLE]
For more details on Sibsonâs -MI we refer the reader to (VerdĂș, 2015), as for maximal leakage the reader is referred to (Issa et al., 2020).
2.1.2 Learning Setting
Let be the sample space, be the hypothesis space, and be a loss function. Say . Let consist of i.i.d samples, where , with unknown. A learning algorithm is a mapping that given a sample provides a hypothesis . can be either a deterministic or a randomized mapping and undertaking a probabilistic (and information-theoretic) approach one can then equivalently consider as a family of conditional probability distributions for i.e., an information channel. Given a hypothesis the true risk of is denoted as follows:
[TABLE]
while the empirical risk of on is denoted as follows:
[TABLE]
Given a learning algorithm , one can then define its generalization error as follows:
[TABLE]
Since both and can be random, is a random variable and one can then study its expected value or its behavior in probability. Bounds on the expected value of the generalization error in terms of information measures are given in Xu and Raginsky (2017); Issa et al. (2019); Bu et al. (2019); Steinke and Zakynthinou (2020) stating different variants of the following bound (Xu and Raginsky, 2017, Theorem 1): if is -sub-Gaussian111A [math]-mean random variable is said to be -sub-Gaussian if for every . then
[TABLE]
Thus, if one can prove that the mutual information between the input and output of a learning algorithm trained on is bounded (ideally, growing less than linearly in ) then the expected generalization error of will vanish with the number of samples. Alternatively, Esposito et al. (2021) demonstrate high-probability bounds, involving different families of information measures. One such bound, which is relevant to the scope of this paper is the following (Esposito et al., 2021, Corollary 2): assume is -sub-Gaussian and let , then
[TABLE]
taking the limit of in (11) leads to the following (Esposito et al., 2021, Corollary 4):
[TABLE]
Thus, in this case, if one can prove that the maximal leakage between the input and output of a learning algorithm trained on is bounded, then the probability of the generalization error of being larger than any constant will decay exponentially fast in the number of samples .
2.2 Problem Setup
We consider iterative algorithms, where each update is of the following form:
[TABLE]
where (sampled according to some distribution), is a deterministic function, computes a direction (e.g., gradient), is the step-size, and is random noise. We will assume for the remainder of this paper that has an absolutely continuous distribution. Let denote the total number of iterations, , and . The algorithms under consideration further satisfy the following two assumptions
- âą
Assumption 1 (Sampling): The sampling strategy is agnostic to parameter vectors:
[TABLE]
- âą
Assumption 2 (-Boundedness): For some and , â.
As a consequence of the first assumption and the structure of the iterates, we get:
[TABLE]
The above setup was proposed by Pensia et al. (2018), who specifically studied the case . Denoting by the final output of the algorithm (some function of ), they show that
Theorem 2.2** ((Pensia et al., 2018, Theorem 1)).**
If the boundedness assumption holds for and , then
[TABLE]
By virtue of inequality (10), this yields a bound on the expected generalization error.
In this work, we derive bounds on the maximal leakage between for iterative noisy algorithms, which leads to high-probability bounds on the generalization error (cf. equation (12)). We consider different scenarios in which is bounded in , , or norm, and the added noise is Laplace, Gaussian, or Uniform. It is worth noting that the bounds we derive depend on only through the boundedness assumption (Assumption 2 above). Considering to be a gradient yields the most (practically) interesting scenario in which our results hold, as it represents a widely used family of learning algorithms. However, we do not leverage any structure that is particular to gradients (beyond the boundedness assumption).
2.3 Notation
Given , , and , let denote the -ball of radius and center , and let denote its corresponding volume. When the dimension is clear from the context, we may drop the superscript and write . Given a set , we denote its complement by . The -th component of will be denoted by .
We denote the pdf of the noise by . The following functional will be useful for our study: given , , a pdf , and an , define
[TABLE]
We denote the âpositive octantâ by , i.e.,
[TABLE]
Since we will mainly consider pdfs that are symmetric (Gaussian, Laplace, uniform), the functional ârestrictedâ to will be useful:
[TABLE]
2.4 General Bound
Proposition 2.3**.**
Suppose is maximized for . If Assumptions 1 and 2 hold for some , then
[TABLE]
where is defined in equation (17).
The above bound is appealing as it implicitly poses an optimization problem: given a constraint on the noise pdf (say, a bounded variance), one may choose as to minimize the upper bound in equation (20). Moreover, despite its generality, we show that it is tight in several interesting cases, including when and is the Gaussian pdf.
In the next section, we consider several scenarios for different values of and different noise distributions. As a testament to the tractability of maximal leakage, we derive exact semi-closed form expressions for the bound of Proposition 2.3. Finally, it is worth noting that the form of the bound allows us to choose different noise distributions at different time steps, but these examples are outside the scope of this paper.
Proof 2.4**.**
We proceed as in the work of Pensia et al. (2018):
[TABLE]
where the first inequality follows from Lemma 2 of Pensia et al. (2018) and the data processing inequality for maximal leakage (Issa et al., 2020, Lemma 1), the second inequality follows Lemma 2.1, and the equality follows from (15). Now,
[TABLE]
where the last equality follows from a change of a variable . Finally, since by assumption, we can further upper-bound the above by:
[TABLE]
where the last equality follows from the assumptions on .
3 Boundedness in -Norm
Considering the case where computes a gradient, then boundedness in -norm is a common assumption. It is commonly enforced, for instance, using gradient clipping (Abadi et al., 2016a, b; Chen et al., 2020).
Theorem 3.1**.**
If the boundedness assumption holds for and , then
[TABLE]
*where . *
Note that even if the parameter is large (e.g., Lipschitz constant of a neural network (Negrea et al., 2019)), it appears in (29) normalized by so its effect is significantly dampened (as is also typically very large).
Finally, note that the bound in Proposition 2.3 is increasing in : this can be seen from line (26), where the supremum over can be further upper-bounded by a supremum over for . Therefore for , the bound induced by Proposition 2.3 is smaller. The bound in Theorem 3.1 corresponds to and goes to 0 (as grows), hence the bound induced by Proposition 2.3 goes to 0 for all .
Proof 3.2**.**
The conditions of Proposition 2.3 are satisfied, thus it is sufficient to prove the bound for (cf. discussion above):
[TABLE]
Hence, it remains to show that the second term inside the matches that of equation (29). To that end, note that the point in that minimizes the distance to is given . So we get
[TABLE]
Then,
[TABLE]
To evaluate this integral, we use spherical coordinates (details in Appendix B). Then,
[TABLE]
Combining equations (31) and (35) yields (29).
Remark 3.3**.**
One could also derive a semi-closed form bound for the case in which the added noise is uniform.
4 Boundedness in -Norm
The bound in Proposition 2.3 makes minimal assumptions about the pdf . In many practical scenarios we have more structure we could leverage. In particular, we make the following standard assumptions in this section:
- âą
is composed of i.i.d components. Let be the pdf of a component, then .
- âą
is symmetric around 0 and non-increasing over
In this setting, Proposition 2.3 reduces to a very simple form for :
Theorem 4.1**.**
Suppose satisfies the above assumptions. If Assumptions 1 and 2 hold for , then
[TABLE]
Note that the bounded- assumption is weaker than the bounded -norm assumption. Moreover, the assumption of having a bounded -norm is satisfied in Pichapati et al. (2019) where the authors clipped the gradient in terms of the -norm, thus âenforcingâ the assumption. On the other hand, the theorem has an intriguing form as, under standard assumptions, the bound depends on only through . This naturally leads to an optimization problem: given a certain constraint on the noise, which distribution minimizes ? The following theorem shows that, if the noise is required to have a bounded variance, then corresponds to the uniform distribution:
Theorem 4.2**.**
Let be the family of probability densities (over ) satisfying for each :
* is symmetric around 0.* 2. 2.
* is non-increasing over .* 3. 3.
.
Then, the distribution minimizing over is the uniform distribution .
That is, uniform noise is optimal in the sense that it minimizes the upper bound in Theorem 4.1 under bounded variance constraints. The proof of Theorem 4.2 is deferred to Appendix D.
4.1 Proof of Theorem 4.1
Since the assumptions of Proposition 2.3 hold, then
[TABLE]
It remains to show that (i.e., the second term inside the in Equation 17) is equal to . We will derive a recurrence relation for in terms of . To simplify the notation, we drop the subscript and ignore the dependence of on , , and , so that we simply write (and correspondingly, , cf. Equation 19).
By symmetry, . Letting , we will decompose the integral over into two disjoint subsets: 1) , in which case can take any value in , and 2) , in which case must satisfy .
[TABLE]
The innermost integral of line (40) is independent of so that the outer integral is equal to . Similarly, the innermost integral of line (39) is independent of , and the supremum in the outer integral yields for every . Hence, we get
[TABLE]
the detailed proof of which is deferred to Appendix C. Finally, it is straightforward to check that , hence .
5 Boundedness in -Norm
In this section, we consider the setting where Assumption 2 holds for . By Proposition 2.3, any bound derived for holds for as well; in particular, Theorem 3.1 applies. Nevertheless, it is possible to compute a semi-closed form directly for (cf. Theorem 5.3 below).
We also consider the case in which the additive noise is Laplace, i.e., âmatchingâ the constraint on the update function. Interestingly, we show that in this case the limit of maximal leakage, as goes to infinity, is finite.
5.1 Bound for Laplace noise
We say has a Laplace distribution, denoted by , if its pdf is given by for , for some and . The corresponding variance is given by .
Theorem 5.1**.**
If the boundedness assumptions holds for and is composed of i.i.d components, each of which is , then
[TABLE]
where . Consequently, for fixed ,
[TABLE]
Proof 5.2**.**
We give a high-level description of the proof (as similar techniques have been used in proofs of earlier theorems) and defer the details to Appendix E. Since the multivariate Laplace distribution (for i.i.d variables) depends on the -norm of the corresponding vector of variables, we need to solve the following problem: given and , compute
[TABLE]
The closest element in will lie on the hyperplane defining that is in the same octant as , so the problem reduces to projecting a point on a hyperplane in -distance (the proof in the appendix does not follow this argument but arrives at the same conclusion). Then, we need to compute . We use a similar approach as in the proof of Theorem 4.1, that is, we split the integral and derive a recurrence relation.
5.2 Bound for Gaussian noise
Finally, we derive a bound on the induced leakage when the added noise is Gaussian:
Theorem 5.3**.**
If the boundedness assumptions holds for and , then
[TABLE]
In order to prove Theorem 5.3 one has to solve a problem similar to the one introduced in Theorem 5.1 (cf. equation (44)). However, in this case a different norm is involved: i.e., given and , one has to compute
[TABLE]
Again, one can argue that the point achieving the infimum lies on the hyperplane defining that is in the same octant as . In other words, the minimizer is such that the sign of each component is the same sign as the corresponding component of (and lies on the boundary of ). Thus, we are projecting a point on the corresponding face of the -ball. The length of the projection is then appropriately lower-bounded and the induced integral is solved by an opportune choice of change of variables. The details of the proof are given in Appendix F.
Acknowledgment
The work in this manuscript was supported in part by the Swiss National Science Foundation under Grant 200364 and by the University Research Board at the American University of Beirut (Beirut, Lebanon).
Appendix A Proof of Lemma 2.1
Recall the definition of maximal leakage and conditional maximal leakage:
Definition A.1** (Maximal Leakage (Issa et al., 2020, Definition 1)).**
Given two random variables with joint distribution ,
[TABLE]
where takes values in a finite, but arbitrary, alphabet, and is the optimal estimator (i.e., MAP) of given .
Similarly,
Definition A.2** (Conditional Maximal Leakage (Issa et al., 2020, Definition 6)).**
Given three random variables with joint distribution ,
[TABLE]
where takes values in a finite, but arbitrary, alphabet, and and are the optimal estimators (i.e., MAP) of given and given , respectively.
It then follows that
[TABLE]
where the last inequality follows from the fact that implies .
The fact that
[TABLE]
has been shown for discrete alphabets in Theorem 6 of (Issa et al., 2020). The extension to continuous alphabets is similar (with integrals replacing sums, and pdfs replacing pmfs, where appropriate).
Finally, it remains to show equation (5). We proceed by induction. The case has already been shown above. Assume the inequality is true up to variables, then
[TABLE]
where the second inequality follows from the induction hypothesis.
Appendix B Proof of equation (35)
To evaluate the integral in line (34), we write it in spherical coordinates:
[TABLE]
Now, note that for any , , and
[TABLE]
where (a) follows from the change of variable , (b) follows from the change of variable , (c) follows from the definition of the Beta function: , and the last equality is a known property of the Beta function (). Consequently,
[TABLE]
To evaluate the innermost integral, the following identity will be useful:
[TABLE]
where the first equality follows from the change of variable . Then,
[TABLE]
where (a) follows from the change of variable , and (b) follows from (61).
Finally, combining equations (58), (60), and (65), we get
[TABLE]
Appendix C Proof of equation (41)
The innermost integral of line (40) evaluates to
[TABLE]
where the first equality follows from the monotonicity assumptions, the second from a change of variable, and the third from the symmetry assumption. Similarly, the innermost integral of line (39) evaluates to
[TABLE]
Combining equations (40), (68), and (71), we get
[TABLE]
where the second equality follows from the fact that is maximized at 0, and is a -dimensional hypercube of side (with volume ). Now,
[TABLE]
Appendix D Proof of Theorem 4.2
Consider any , and let
[TABLE]
Then
[TABLE]
where the second equality follows from the symmetry assumption. Note that is a valid probability density over , and let . Then, by previous equation,
[TABLE]
Hence,
[TABLE]
which is achieved by the uniform distribution .
Appendix E Proof of Theorem 5.1
First, we show that the limit of the right-hand side of equation (42) is given by the right-hand side of equation (43). Note that
[TABLE]
On the other hand,
[TABLE]
Since is finite, the limit and the sum are interchangeable, so that the above two equations yield the desired limit.
We now turn to the proof of inequality (42). For notational convenience, set (so that for all ) and . Since the noise satisfies the assumptions of Proposition 2.3, we get
[TABLE]
Recall (cf. equation (17)) is defined to be the second term inside the . Similarly to the strategy adopted in the proof of Theorem 4.1, we will derive a recurrence relation for in terms of , as such we will again suppress the dependence on , , and in the notation, and write only (and correspondingly ).
Lemma E.1**.**
Given ( defined in equation (18)),
[TABLE]
Proof E.2**.**
Since we are minimizing a continuous function over a compact set, then the infimum can be replaced with a minimum.
Claim:* There exists a minimizer such that for all , .*
Proof of Claim:* Consider any such that there exists satisfying . Note that by assumption. Now define . Then so that . Moreover, as desired. *
Now,
[TABLE]
Given the above lemma, we will derive the recurrence relation by decomposing the integral over into two disjoint subsets: 1) , in which case can take any value in , and 2) , in which case must satisfy .
[TABLE]
Hence,
[TABLE]
It is easy check that , and hence
[TABLE]
satisfies the base case and the recurrence relation. Re-substituting and for and , respectively, yields the desired result in equation (42).
Appendix F Proof of Theorem 5.3
Let . Since the noise satisfies the assumptions of Proposition 2.3, we get
[TABLE]
Consider
[TABLE]
First we solve . If , then the infimum is achieved for as well (one can simply flip the sign of any negative component, which cannot increase the distance). In the subspace , the boundary of the ball is defined by the hyperplane . As such, finding the minimum distance corresponds to projecting the point to the given hyperplane:
[TABLE]
Now,
[TABLE]
For notational convenience, we drop the subscript in the following. We perform a change of variable as follows: . Hence, for , . Since , then . For , define . Then,
[TABLE]
where (a) follows from the fact that the innermost integral corresponds to the volume of a scaled probability simplex (scaled by ), and (b) follows from the same computations as in Equations 62 to 65 (with ). Noting that yields the desired the term in Equation 45.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abadi et al. (2016 a) MartĂn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. ar Xiv preprint ar Xiv:1603.04467 , 2016 a.
- 2Abadi et al. (2016 b) Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security , pages 308â318, 2016 b.
- 3Bartlett and Mendelson (2003) Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res. , 3(null):463â482, mar 2003. ISSN 1532-4435.
- 4Bartlett et al. (2017) Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/b 22b 257ad 0519 d 4500539 da 3c 8bcf 4dd-Paper.pdf .
- 5Bartlett et al. (2019) Peter L. Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research , 20(63):1â17, 2019. URL http://jmlr.org/papers/v 20/17-612.html .
- 6Bassily et al. (2018) Raef Bassily, Shay Moran, Ido Nachum, Jonathan Shafer, and Amir Yehudayoff. Learners that use little information. volume 83 of Proceedings of Machine Learning Research , pages 25â55. PMLR, 07â09 Apr 2018.
- 7Bousquet et al. (2003) Olivier Bousquet, StĂ©phane Boucheron, and GĂĄbor Lugosi. Introduction to statistical learning theory. In Olivier Bousquet, Ulrike von Luxburg, and Gunnar RĂ€tsch, editors, Advanced Lectures on Machine Learning , volume 3176 of Lecture Notes in Computer Science , pages 169â207. Springer, 2003. ISBN 3-540-23122-6.
- 8Bu et al. (2019) Yuheng Bu, Shaofeng Zou, and Venugopal V. Veeravalli. Tightening mutual information based bounds on generalization error. In 2019 IEEE International Symposium on Information Theory (ISIT) , pages 587â591, 2019. 10.1109/ISIT.2019.8849590 . · doi â
