Pointwise adaptive kernel density estimation under local approximate differential privacy
Martin Kroll

TL;DR
This paper develops a pointwise adaptive kernel density estimator under local approximate differential privacy, achieving near-optimal convergence rates while ensuring data privacy at the individual level.
Contribution
It introduces a privacy-preserving kernel density estimation method with adaptive bandwidth selection, providing theoretical guarantees and optimal convergence rates under local differential privacy.
Findings
Optimal convergence rate under privacy: n^{-(2s-1)/(2s+1)}
Adaptive estimator attains near-optimal rate with logarithmic factors
Method compatible with multiple statistical procedures in privacy-preserving data analysis
Abstract
We consider non-parametric density estimation in the framework of local approximate differential privacy. In contrast to centralized privacy scenarios with a trusted curator, in the local setup anonymization must be guaranteed already on the individual data owners' side and therefore must precede any data mining tasks. Thus, the published anonymized data should be compatible with as many statistical procedures as possible. We suggest adding Laplace noise and Gaussian processes (both appropriately scaled) to kernel density estimators to obtain approximate differential private versions of the latter ones. We obtain minimax type results over Sobolev classes indexed by a smoothness parameter for the mean squared error at a fixed point. In particular, we show that taking the average of private kernel density estimators from different data owners attains the optimal rate of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Cryptography and Data Security · Vehicular Ad Hoc Networks (VANETs)
Pointwise adaptive kernel density estimation under local approximate differential privacy
Martin Kroll
CREST, ENSAE, Institut Polytechnique de Paris
Synopsis.
We consider non-parametric density estimation in the framework of local approximate differential privacy. In contrast to centralized privacy scenarios with a trusted curator, in the local setup anonymization must be guaranteed already on the individual data owners’ side and therefore must precede any data mining tasks. Thus, the published anonymized data should be compatible with as many statistical procedures as possible. We suggest adding Laplace noise and Gaussian processes (both appropriately scaled) to kernel density estimators to obtain approximate differential private versions of the latter ones. We obtain minimax type results over Sobolev classes indexed by a smoothness parameter for the mean squared error at a fixed point. In particular, we show that taking the average of private kernel density estimators from different data owners attains the optimal rate of convergence if the bandwidth parameter is correctly specified. Notably, the optimal convergence rate in terms of the sample size is under local differential privacy and thus deteriorated to the rate which holds without privacy restrictions. Since the optimal choice of the bandwidth parameter depends on the smoothness and is thus not accessible in practice, adaptive methods for bandwidth selection are necessary and must, in the local privacy framework, be performed directly on the anonymized data. We address this problem by means of a variant of Lepski’s method tailored to the privacy setup and obtain general oracle inequalities for private kernel density estimators. In the Sobolev case, the resulting adaptive estimator attains the optimal rate of convergence at least up to extra logarithmic factors.
Key words and phrases:
Kernel density estimation. Approximate local differential privacy. Reproducing kernel Hilbert space. Adaptive estimation. Lepski’s method.
2010 Mathematics Subject Classification:
62G05 (primary), and 68P25 (secondary)
The author gratefully acknowledges financial support from GENES and by the French National Research Agency (ANR) under the grant Labex Ecodec (ANR-11-LABEX-0047).
1. Introduction
In the modern information era data are routinely collected in all areas of private and public life. Although the availability of massive data sets is essential to answer important scientific and societal questions, the individual data owners (who may be individuals, households, research institutions, companies, …) might refuse to share their, maybe sensitive, raw data with others. Even more, in view of regularly reported data leaks, they may not even want to entrust their data to a central curator who stores the data and publishes anonymized summary statistics. Finding ourselves in such a dilemma, the question whether and, if yes, how data analytics can still be performed is of special importance. For the evaluation of this question, several aspects have to be taken into account.
Firstly, in absence of a trusted curator, privacy of the data has to be achieved already locally at the individual data owners’ level. The -th data holder takes its datum, say , as the input of a privacy mechanism and creates an output that is considered sufficiently anonymized, for instance, in the sense of any of the privacy definitions listed below. For the purpose of the present paper, a privacy mechanism is a Markov kernel between measurable spaces and generating given according to the distribution . This definition of local privacy is in contrast to the framework of centralized or global privacy where the trusted curator can take the whole data set to create an output .111Thus, the local privacy model can be seen as a proper submodel of the global one because the trusted curator can also mimic any conceivable procedure in the local model.
Secondly, for the quantification of privacy, different solutions have been proposed so far (see [BD14], Section 2 for a comprehensive overview of existing privacy definitions):
In this paper, we will exclusively work in the framework of -differential privacy and its generalization -differential privacy as defined in Definition 2.1 below. These two privacy definitions are also referred to as pure and approximate differential privacy, respectively. Originally, these concepts have been suggested for the anonymization of microdata tables in a global privacy setup, more precisely in a framework where queries are answered by a server that has direct access to the sensitive data [Dwo06, Dwo+06, Dwo08]. In the statistics community, working under privacy constraints has been popularized in the past decade, amongst others, through the articles [WZ10, HRW13] (in the global setup) and [DJW18] (in the local privacy setup). Another strict relaxation of pure differential privacy is random differential privacy as introduced in [HRW11].
An alternative quantification of privacy can be given as follows: Let be a function from to with . Then, the associated -divergence between two distributions is
[TABLE]
where is a measure such that and denote the corresponding Radon-Nikodym densities. Then, the mechanism is called --divergence private if
[TABLE]
The intersection of these two concepts is non-empty: For instance, taking , the -divergence is the total variation distance, and the resulting --divergence is equivalent to -differential privacy.
Thirdly, the published data should ideally be multi-purpose in the sense that they can serve as input data for several types of analyses. Thus, when the unmasked data are for instance a sample from an unknown probability distribution, the anonymized data should contain as much information as possible about the whole distribution and not only about certain characteristics. One main motivation for this work is to introduce novel methodology in the framework of density estimation that aims to address also this issue by proposing a local approximate differential private version of kernel density estimators, that is, the whole function for a bandwidth parameter along with a study of their theoretical properties. Figure 1 gives a foretaste and provides a graphical representation of the general workflow developed in this paper.
Roadmap of the article
Throughout the article, we consider the paradigmatic example of non-parametric density estimation. For the sake of simplicity, we assume that each of data holders observes a size-one sample from a (in this paper) univariate target density , but refuses to share this observation. In Section 2, we first introduce two mechanisms to estrange the value of an kernel density estimator at a single fixed point . The first approach is based on adding appropriately scaled Laplace noise. The second approach is based on adding Gaussian noise and can be extended, using the ideas introduced in [HRW13], to an anonymized version of the whole kernel density estimator (as a function from to ) via perturbation by a suitable Gaussian process. In Section 3, we consider estimation of the unknown density function under approximate differential privacy from a minimax point of view. As the performance measure to evaluate arbitrary estimators, we consider the mean squared error at a fixed point. Both the Laplace and the Gaussian perturbation approach attain the optimal rate in terms of over Sobolev ellipsoids with smoothness index which is slower than the rate in the setup without privacy constraints. The Gaussian process approach, however, makes it possible to estimate the value of the density at any point of the observation window and not only at one single point that has to be chosen prior to the anonymization procedure. In addition, this approach enables the statistician to perform any kind of analysis that plugs kernel density estimators into others estimators. Investigating theoretical guarantees of such plug-in procedures, however, is outside the scope of this work and deferred to future research.
As usual for kernel density estimators, the choice of the bandwidth parameter is crucial. In the considered minimax framework over Sobolev classes, the optimal order of the bandwidth that leads to a rate optimal estimator depends on the smoothness index which is typically unknown. In Section 4, we apply a Lepski scheme tailored to the privacy framework to overcome this problem and obtain an adaptive choice of the bandwidth. This issue specifically arises in the local privacy setup since in the global framework the trusted curator can apply the existing plethora of methods for bandwidth selection on the unmasked data, and then only publish the resulting estimator with the adaptively determined bandwidth in its anonymized form. In order to perform the Lepski scheme, any data owner has to publish the kernel density estimator not only for one single bandwidth but for a finite set of potential bandwidths. Such a multiple output still guarantees the desired privacy condition provided that the additive noise is multiplied with a factor proportional to the number of potential bandwidths which is logarithmic in the number of data sources in our case. We derive general oracle type inequalities for the estimator resulting from the Lepski procedure adapted to the privacy framework. For the specific example of Sobolev ellipsoids, the rates of convergence are merely deteriorated by logarithmic factors with respect to the case of a priori known smoothness.
2. Privacy mechanisms
2.1. Definition of approximate differential privacy
Let and be measurable spaces. A privacy mechanism is a Markov kernel with the interpretation that, given original data , an anonymized output is randomly drawn from the probability measure . In the non-interactive setup that we are going to consider, we work under the following definition of approximate or -differential privacy.
Definition 2.1**.**
Let . We say that is a local -differentially private view of if for all , the estimate
[TABLE]
holds true.
Let us emphasize that in Definition 2.1 the spaces and do not need to coincide. In fact, in Example 2.9 the space will be the real line equipped with its Borel sets and a measurable space of random functions. In the literature, the case is also referred to as -differential privacy or pure differential privacy. Evidently, the privacy condition (1) becomes more restrictive for smaller values of the two parameter and . Although Definition 2.1 smoothly bridges the cases and , the classical anonymization techniques used for and are essentially different: In the case , Laplace perturbation as well as randomization techniques as considered in [DJW18, RS18] can be used. In the case , adding appropriately scaled Gaussian noise has been suggested in [HRW13]. However, as proved in [HLM15], appropriately scaled Laplace noise can also lead to approximately differential private outputs (see Proposition 2.2 below). In the sequel, we discuss how to achieve approximate differential privacy by means of these classical subroutines and how they can be extended to deal with functional data as well.
2.2. Univariate output using Laplace noise
First, we consider the case that both the input and the output of the privacy mechanism are univariate and real-valued, that is . For this case, we consider Laplace perturbation which is also used to derive an upper bound in Section 3. More precisely, let a quantity derived from the that should be masked. Define the sensitivity of as
[TABLE]
Recall that the univariate Laplace distribution, denoted by , is given by the probability density function (we include also the case ; then the Laplace distribution is, by convention, the Dirac measure concentrated at [math]). In particular, the variance of an distributed random variable is . The following proposition establishes approximate differential privacy by Laplace perturbation.
Proposition 2.2** (See [HLM15], Example 5).**
Let , . Then
[TABLE]
with for provides an -differential private view of (and of as well).
A benefit of Proposition 2.2 in contrast to the often proposed perturbation by Gaussian noise to establish approximate differential privacy is that it allows to deal with the cases and by the same approach. Moreover, letting the parameter vary permits natural interpretations: If , the variance of corresponds to the one that is usually encountered in the case of pure differential privacy. When tends to one, the privacy constraint gets weaker and the variance of the centred noise tends to [math]. In the extreme case it is even allowed to publish directly.
We now introduce kernel density estimators as the guiding example that we have in mind for the function for the rest of the paper.
Example 2.3**.**
Let i.i.d. according to an unknown probability density function . Let be fixed. Then the -th dataholder, who observes , can compute
[TABLE]
for a bounded kernel function , that is, is integrable and . The quantity will play the role of in Proposition 2.2. By the triangle inequality , and one can take any to obtain an approximate differential private view of . Note that has been fixed in advance before the anonymization procedure.
2.3. Multivariate output
In principle, also multivariate output could be dealt with by adding independent Laplace noise to any of the components of the vector to be published. In this case, both and for each component have to be appropriately scaled in order to obtain the desired level of approximate differential privacy for the whole vector (the scaling can be carried out, for instance, as described in Lemma 2.16 below). This approach, however, results in an increase concerning the Laplace noise added at any single point where the kernel density estimator is evaluated, and thus might deteriorate the performance of subsequent analyses more than necessary. We do not further pursue this course here, since we will introduce a method for the anonymization of functional data that does not inflate the noise at single points in the next subsection. Having stated this general method, we can, for instance, anonymize the whole function in Example 2.3, and as a by-product we obtain -differential privacy for all pointwise evaluations , without any extra cost on the noise to be added. To achieve anonymization of functional data, adding Gaussian processes with appropriately chosen covariance structure turns out to be convenient. This idea has been originally suggested in [HRW13], but we state the essential steps here again for a clear exposition, and refer to [HRW13] only for the proofs. The first stopover on our way along the results from [HRW13] is the following proposition that provides a condition under which approximate differential privacy of a vector is obtained by adding multivariate Gaussian noise with not necessarily uncorrelated components.
Proposition 2.4**.**
Let , . Let further be a positive definite matrix and for some . Assume that
[TABLE]
for all . Then, defined via
[TABLE]
is -differential private provided that
[TABLE]
Proposition 2.4 will unfold its full potential in the next subsection where the condition (2) will be reformulated appropriately. For the univariate case (taking ), Proposition 2.4 directly provides a result similar to the one in Example 2.3, again with fixed before anonymization.
Example 2.5**.**
We consider as in Example 2.3 and apply Proposition 2.4 for and . As in Example 2.3,
[TABLE]
and one can take in (2). Then, Proposition 2.4 guarantees that the , defined through
[TABLE]
is an -differential private view for of (and of as well).
2.4. From multivariate to functional output
The anonymization techniques used in Examples 2.3 and 2.5 both suffer from the drawback that the output provides information on the kernel density estimator for one single only. The aim of this section, based on Proposition 2.4 and ideas introduced in [HRW13] in the context of global privacy, is to construct a privatized version of the whole function by adding a suitable Gaussian process to the kernel density estimator. As a consequence, the kernel density estimator anonymized in this vein can be evaluated at any single .
For univariate and multivariate real-valued outputs of privacy mechanisms, the role of the -field in Definition 2.1 is canonically taken by the Borel sets on or . In the case of functional output (where is an arbritary set), its role is taken by the -field which is generated by the cylinder sets
[TABLE]
where ranges over all finite sets and . The following result is a reformulation of Proposition 7 in [HRW13] and we omit its proof. See also Example 4 in [HLM15] for an alternative reasoning.
Proposition 2.6**.**
Let be a sample path of a centred Gaussian process with covariance kernel . For , consider the Gram matrix
[TABLE]
Let be a (random) function in a function class . Then, the release of
[TABLE]
with fulfilling (3) is -differential private (with respect to ) provided that
[TABLE]
where is defined in (2).
The main question arising from Proposition 2.6 is how the, on a first sight unhandy condition (4), might be verified. The solution consists in transferring the problem into a reproducing kernel Hilbert space (RKHS) setup. In fact, Proposition 2.6 can be applied effectively when the random functions to be masked belong to the RKHS which corresponds to the covariance kernel of the Gaussian process .
In order to formulate this next result from [HRW13], we need to introduce some basic notation concerning the considered RKHS (we refer the reader to [BT04] for a comprehensive introduction to RKHS theory). Let be a positive definite kernel. Recall that a real-valued kernel is positive definite if
[TABLE]
holds for any , , and . For any , define the function by . Then the set
[TABLE]
is a pre-Hilbert space with respect to the norm induced by the scalar product
[TABLE]
for , . The RKHS corresponding to the kernel is the Hilbert space resulting from the completion of with respect to the RKHS norm . The following two results are again taken from [HRW13].
Proposition 2.7** (See [HRW13], Proposition 8).**
For , where is the RKHS corresponding to the kernel , and for any finite sequence of distinct points from , we have
[TABLE]
Corollary 2.8** (See [HRW13], Corollary 9).**
For , the release of
[TABLE]
with as in (3) is -differential private with respect to provided that
[TABLE]
and is the sample path of centred Gaussian process with covariance kernel (given by the reproducing kernel of ).
We now apply Corollary 2.8 to kernel density estimators.
Example 2.9**.**
In the case of univariate density estimation the -th data holder observes drawn from the target density , and we want him to be able to publish a approximately differential private version of the kernel density estimator
[TABLE]
based on his single observation only. In order to apply the above theory we have to assume that the kernel 222We slightly abuse notation by denoting both the kernel of the kernel density estimator and the corresponding kernel given through by the letter . is also a positive definite kernel. Under this additional assumption, Corollary 2.8 shows that the perturbed kernel density estimator
[TABLE]
where a Gaussian process with kernel ensures -differential privacy provided that (6) is satisfied. For instance, for the Gaussian kernel we have
[TABLE]
and we can take (the same argument working for any non-negative bounded kernel, and with a slight modification for any bounded kernel).
Let us emphasize that the property of positive definiteness is not satisfied for all kernels commonly used for kernel density estimators in non-parametric statistics. In the following, we discuss some popular examples.
Example 2.10**.**
The rectangular kernel given by
[TABLE]
for is not positive definite. In order to see this, set , , , , and . Then
[TABLE]
which contradicts the defining property (5) of positive definite kernels.
Example 2.11**.**
The triangular kernel given by
[TABLE]
for is positive definite. This follows from the fact that kernels of the form
[TABLE]
for with square integrable are positive definite and
[TABLE]
Example 2.12**.**
The Gaussian kernel
[TABLE]
and the exponential kernel
[TABLE]
are positive definite. These kernels of the form are positive definite if and only if . This follows by combination of Theorem 2.2 and Exercise 2.13, (b) in [BCR84].
Example 2.13**.**
The kernel given by
[TABLE]
is positive semidefinite since the -function is the characteristic function of the uniform distribution on the interval . The -kernel attains also negative values but grant to the estimate we have, in analogy to the calculation in Example 2.9,
[TABLE]
which yields a suitable bound for in this example.
Example 2.14**.**
The Epanechnikov kernel
[TABLE]
is not positive definite. In order to see this, put , , , and . Then,
[TABLE]
in contradiction to the defining property (5) of positive definite kernels.
Example 2.15**.**
The biweight kernel
[TABLE]
is not positive definite. To see this, put , , , and . Then, consider the matrix . We have
[TABLE]
and the matrix is not positive definite, since for
[TABLE]
2.5. A composition lemma for approximate differential privacy
For kernel density estimation, bandwidth selection is usually a delicate issue and so it is in our local privacy setup. Whereas in the centralized setup existing methods can be applied by the trusted curator on the unmasked data, this is not possible in our local setup. Thus the data holders have to publish versions of the kernel density estimator for different bandwidths, and one has to adapt general strategies from the non-private framework to the one with local approximately differential private data. To do this under our privacy constraint it is necessary to understand how multiple outputs influence the defining condition of approximate differential privacy. The following lemma provides a result of this flavour and is known in the research literature on privacy for statistical databases. The setup is the following: Given the unmasked datum , the data owner does not only want to publish but also , i.e., the vector . The following result tells us how and for the single components have to be scaled in order to obtain -differential privacy for multiple outputs.
Lemma 2.16** (Composition lemma for -differential privacy).**
Let , be -differential private and conditionally (on ) independent views of , respectively. Then is an -differential private view of .
Of course, Lemma 2.16 can be successively applied. For instance, if we want to publish from the above examples for different in a finite set , then and should be replaced with and , respectively, in order to get differential privacy for .
3. Private minimax estimation
Minimax theory provides a standard framework to study convergence properties of estimators in non-parametric statistics [Tsy09]. In this section, we apply this general toolbox to the specific case of density estimation under privacy constraints. For fixed and any estimator of the linear functional based on the private views , we study its mean squared error
[TABLE]
The guiding principle of minimax theory is to look for estimators that perform best in a worst-case scenario. However, due to the privacy framework, we have not only the freedom of choosing the estimator but also the privacy mechanism that generates the private outputs. Hence, following [DJW18], classical minimax theory has to be adapted and a natural quantity to consider is the private minimax risk
[TABLE]
where is some function class containing probability densities and the infimum is taken over all local -differential private Markov kernels and all estimators based on the local approximate differential private views of the corresponding original sample . We specify the function class by so called Sobolev ellipsoids that we define for and by means of
[TABLE]
which, for , is equivalent to the definition
[TABLE]
In the first definition, denotes the Fourier transform of the density , in the second one denotes the weak -th derivative of .
3.1. Upper bound
We first derive an upper bound on the minimax risk by specializing both the privacy mechanism and the estimator of . Concerning the privacy mechanism, we consider the mechanisms mapping to private views of from Section 2 for one single . More precisely, we consider the Laplace mechanism given through
[TABLE]
and the Gaussian process mechanism given through
[TABLE]
where are i.i.d. Gaussian processes with covariance kernel and is an upper bound on for . Given as in (7) or (8), a natural estimator of is given by
[TABLE]
The following proposition provides an upper risk bound for this estimator specialized with the -kernel over the Sobolev ellipsoids introduced above.
Proposition 3.1**.**
Consider the kernel density estimator for some fixed where the kernel used in the anonymization procedure (7) or (8) is the -kernel from Example 2.13. Then, for any ,
[TABLE]
for some . In particular, setting with , we obtain
[TABLE]
Since the noise added by the privacy mechanisms is centred, the bias term in the proof of Proposition 3.1 remains unchanged in comparison to the standard setup without privacy constraints. However, the variance term changes due to the additional Laplace or Gaussian noise, respectively, and the classical variance term is joined by the additional term which is of higher order for . Consequently, the optimal bandwidth is no longer of order as in the standard setup but of the larger order . However, consistency of is already guaranteed if and simultaneously (in the standard density estimation setup one only needs in addition to ).
3.2. Lower bound
The following result states a lower bound over Sobolev ellipsoids in the case of pure differential privacy ().
Proposition 3.2**.**
Let arbitrary. Then,
[TABLE]
where depends on the privacy parameter, and the infimum is taken over all estimators based on private views and privacy mechanisms providing -differential privacy.
Remark 3.3*.*
The lower bound of Proposition 3.2 still holds true when one allows a slight amount of interaction between the data holders, namely when the distribution of every is determined by and the previously masked values . The proof remains the same because the data processing inequality (14) from [DJW18] still holds true in this more general setup.
Proposition 3.2 shows that, regarding the privacy parameter as an a priori fixed constant, the estimators from Proposition 3.1 attain the optimal rate in terms of under pure local differential privacy. Recall that without privacy restrictions the optimal rate over Sobolev ellipsoids is (as mentioned in [But01], this rate can, other than by a reduction scheme as used in our proof, be easily obtained via the theory developed in [DL92], see also [Tsy98]). In this work, we consider the parameters , as fixed and are interested in the behaviour of the rate as a function of only but remarks concerning analogous to the ones made in [But+19] could be made (as in that paper, and could also be allowed to vary with ). The optimal behaviour, however, of the rates in terms of the privacy parameters and , especially if , remains an open issue.
4. Adaptation to unknown smoothness
The estimators of the previous section are not completely satisfying since the optimal choice of the bandwidth, as usually in non-parametric statistics, depends on a priori knowledge of the smoothness of the unknown function . Such knowledge is usually not available in practise. At least, using the Gaussian process perturbation approach we relieved ourselves from the drawback of the Laplace method that one can privatize only one functional of the form for one single that has to be fixed even before the anonymization. Note that this drawback is, for instance, also present in the mechanisms suggested in [RS18]. From this point of view, anonymization of the whole kernel density estimator via this approach should be preferred.
The purpose of this section is to address the remaining issue of adapting to the unknown smoothness of . In order to tackle this problem, we use a variant of Lepski’s method (see [LS97] for a general account in the Gaussian white noise model, and [Cav01] for an application to a tomography problem whose concise presentation has inspired our one). Recall again that the necessity of novel methodology for adaptive estimation is specific for the setup of local privacy since in the global case the trusted curator can choose the bandwidth in an adaptive way using all the data and, as a consequence, can build on the existing plethora of methods and theoretical results for this standard case; hence bandwidth selection does not provide any additional difficulty for centralized privacy since only the final output is anonymized. In our local setup, where the data owners publish their data prior to any data analysis, adaptation must be addressed separately. Note that the problem of adaptation has, to the best of the author’s knowledge, only been addressed in the recent work [But+19] so far, where the authors use wavelet estimators for density estimation on a compact interval. The approach in that paper is thus conceptionally different from the one presented in the sequel.
We will apply Lepski’s method both on observations (7) where has been fixed a priori and on pathwise observations (8) from the Gaussian process approach that we evaluate at the point of interest. In order to apply Lepski’s method, the observations (7) and (8) must be available for different values of the bandwidth parameter , say . This can be realized using Lemma 2.16 provided that the privacy parameters and are appropriately scaled. Thus, we can assume that are accessible for any and if we replace and by and , respectively. For any and , we can then consider the estimator defined in (9). In our case, we define the set of potential bandwidths by a geometrid grid,
[TABLE]
where is a fixed constant, is such that , and satisfies . For and some , define333In the sequel, we write for both and .
[TABLE]
where is defined as in Section 3. The proof of 3.1 shows that
[TABLE]
if . Put with being a sufficiently large constant (an explicit value can be determined from the proof of Theorem 4.3) and define
[TABLE]
If the set in the definition of is empty, we set by convention. However, in the proof of Proposition 4.1 we will show that this set is non-empty for large enough. The bandwidth is an oracle in the sense that it is not accessible by the statistician since it depends on the unknown parameter . The definition of provides some kind of ideal criterion: The bandwidth is increased along the grid as long as the bias term it is bounded by the ’rate’ , a procedure that aims at mimicking the classical bias-variance tradeoff. In order to state a risk bound for the pseudo estimator , we further define
[TABLE]
Proposition 4.1**.**
Consider the pseudo-estimator defined via (9) and (10) where and are replaced with and , respectively. Assume that
[TABLE]
Consider . Then, for sufficiently large,
[TABLE]
uniformly for all with .
Remark 4.2*.*
Assumption (11) is satisfied in many cases. For instance, if , then (11) is a special case of Bochner’s lemma (see [Tsy04], Lemma 1.1). However, the -kernel is not absolutely integrable and thus Bochner’s lemma cannot be applied. In this case, one can alternatively assume that belongs at least to some Sobolev space for some . Then, the analysis of the bias term as in the proof of Proposition 3.1 guarantees the validity of (11).
The pseudo estimator is a stopover on our road to an adaptive estimator. We now construct a genuine estimator of that aims at mimicking this oracle. For this, we first define
[TABLE]
Then, calculations similar to those in the proof of Proposition 3.1 show that
[TABLE]
if . For , put
[TABLE]
Then, we define an adaptive choice of the bandwidth parameter by
[TABLE]
This choice of the bandwidth is well-defined since the maximum is taken over a non-empty set. The definition of is characteristic for Lepski’s method [Lep90], and the motivation of this procedure is neatly described in [Cav01], p. 67: One chooses the largest bandwidth such that the difference between the two estimators and is not too large (in the sense of (12)) for all . Evidently, the motivation of this procedure is to mimick the trade-off between squared bias and variance in a purely data-driven manner. Note also that (12) provides, as well as the oracle version (10), a local choice of the bandwidth in the sense that depends on . Such a local criterion might result in a better adaptation to spatial inhomogeneity of the target density than global selection rules.
Theorem 4.3**.**
Consider the estimator defined via (9) and (12) where for are defined via (7) or (8) with and replaced with and , respectively. Then, uniformly for all with ,
[TABLE]
As a consequence, taking , we obtain
[TABLE]
Remark 4.4*.*
By specifying Theorem 4.3 with the -kernel and , one obtains an adaptive estimator attaining the optimal rate of convergence over functions bounded by in a Sobolev ellipsoid up to an extra logarithmic factor. A logarithmic loss for adaptation is commonly accepted and even known to be indispensable for pointwise estimation in the non-private framework [BL96].
5. Discussion
We have suggested an approach to adaptive kernel density estimation via Lepski’s method in the framework of local approximate differential privacy. Although we have studied its theoretical properties in the prototypical example of univariate density estimation only, our methodology should also be transferable to the multivariate case. We also conjecture that it might be possible to extend our results to the case of general linear functionals (different from pointwise evaluation of the density function at a fixed point) as investigated in [GP00] via Lepski’s method in a inverse problem setup. Furthermore, our methodology might be applicable to obtain local private estimation procedures in functional data analysis. However, a lot of questions remain open: One drawback of our approach is that the perturbation by a Gaussian process provides only approximate differential privacy and cannot be extended to pure differential privacy. The creation of new methods for kernel estimators that overcome this restriction provides a further direction for future research. Moreover, the optimal power of the logarithmic factor in the adaptive rate of convergence deserves further investigation as well as the behaviour of the minimax optimal rates in terms of the privacy parameters and .
Appendix A Proofs of Section 2
A.1. Proof of Proposition 2.2
Let be arbitrary. It has to be shown that
[TABLE]
for any . By the triangle inequality this holds true if
[TABLE]
and the latter holds true as soon as which is equivalent to .
A.2. Proof of Proposition 2.4
We have to show that
[TABLE]
for all . This condition is satisfied if the set where the ratio exceeds has probability bounded by under . We have
[TABLE]
and the condition is equivalent to
[TABLE]
which in turn can be reformulated as
[TABLE]
Set and let denote a distributed random variable where denotes the -dimensional identity matrix. Then
[TABLE]
where is a univariate standard Gaussian random variable. We now use the standard estimate whose right-hand side is smaller than if . We apply this estimate with , and thus if
[TABLE]
and this holds at least if
[TABLE]
A.3. Proof of Lemma 2.16
Let be a measurable set. Denote which is measurable. By Cavalieri’s principle and the independence assumption
[TABLE]
Now put . Then since otherwise there would be a contradiction to approximate differential privacy. Hence,
[TABLE]
which shows the claim assertion.
Appendix B Proofs of Section 3
B.1. Proof of Proposition 3.1
The bias-variance decomposition for the estimator is
[TABLE]
where . We begin with the analysis of the bias. First recall that
[TABLE]
and due to centredness of the error added by the privacy mechanism
[TABLE]
Thus, using that , we obtain
[TABLE]
Let us now consider the variance, where we have to distinguish between the case of Laplace mechanism and Gaussian mechanism. We denote
[TABLE]
For the Laplace mechanism, we have by denoting that
[TABLE]
In a similar fashion, for the Gaussian mechanism, now letting , we have
[TABLE]
The statement of the proposition follows now by combining the obtained bounds for squared bias and variance.
B.2. Proof of Proposition 3.2
Let , be arbitrary as in the statement of the proposition. Define via . Let , be two functions in (to be specified later on) such that . Using a general reduction argument (see [Tsy09], Section 2.2) it can be shown that
[TABLE]
where the infimum is taken over all -valued test functions based on the observations and denotes the distribution of if the true density of is . In view of [Tsy09], Theorem 2.2, Statement (iii), the claim assertion follows if we can choose the functions and such that
- (1)
, 2. (2)
, and 3. (3)
for some independent of .
To construct such we use ideas from Section 6 of [But01] and refer to this paper also for some of the computations. First, take a strictly positive probability density on that is infinitely often continously differentiable. Setting , we can further assume that . Then, for , define the function by
[TABLE]
In order to define the second hypothesis we consider the auxiliary function as introduced on p. 26 of [But01] (its construction in that paper is borrowed from [Tsy98]). In particular, note that is compactly supported and satisfies (thus ) and . Set , and put
[TABLE]
for some constant . Defining , set
[TABLE]
We now check conditions 1–3 from above.
Verification of 1:
The proof follows step by step along the lines of the one in [But01] and we omit the details. We only record the fact that
[TABLE]
which will be used below.
Verification of 2:
We have
[TABLE]
Now, since and , the last expression inside the outer absolute values is greater than for sufficiently large , say . Hence for ,
[TABLE]
which is the desired bound.
Verification of 3:
By Equation (14) in [DJW18] we have
[TABLE]
Now
[TABLE]
for sufficiently large. Thus, by (14), for sufficiently large
[TABLE]
Appendix C Proofs of Section 4
C.1. Proof of Proposition 4.1
Under Assumption (11), we have that converges to zero as . Let . By definition of , and (since ),
[TABLE]
hence , and the set in the definition of is non-empty provided that is sufficiently large. Now, the bias-variance decomposition of the pseudo estimator is
[TABLE]
Let now be the minimizer in the definition of . We distinguish the cases and . First, if , then
[TABLE]
If , then by the very definition of we obtain
[TABLE]
and thus also in this case.
C.2. Proof of Theorem 4.3
We consider the risk decomposition
[TABLE]
and study the two terms on the right-hand side separately.
Analysis of the first term (Case ). Note that the quantities satisfy and for . Thus, using the inequality , we have for that
[TABLE]
By the definition of and , we obtain
[TABLE]
Hence (recall that we denote ),
[TABLE]
where we used the bound for the term and the definition of to bound the term .
Analysis of the second term (Case ). For with , set
[TABLE]
Let in . Then, by definition of ,
[TABLE]
and thus
[TABLE]
We obtain
[TABLE]
By definition of , for all with , it holds
[TABLE]
Now, for ,
[TABLE]
where . Note that and . Now, by the Cauchy-Schwarz inequality,
[TABLE]
For the first term in the sum, we have
[TABLE]
Putting , we have
[TABLE]
On the one hand,
[TABLE]
on the other hand
[TABLE]
Hence,
[TABLE]
Moreover, for ,
[TABLE]
by the very definition of . Thus, altogether,
[TABLE]
and by the monotonicity of and , for
[TABLE]
Write where and with i.i.d. or with i.i.d. for . We have
[TABLE]
Consider first. By Bernstein’s inequality (see Lemma D.1) with ,
[TABLE]
Note that
[TABLE]
For any and large enough, it holds
[TABLE]
Thus
[TABLE]
For the probability in terms of , we consider now the Gaussian case first. Using standard concentration results for the Gaussian distribution, we obtain
[TABLE]
where and denotes the variance of the Gaussian random variable . Then,
[TABLE]
Thus,
[TABLE]
Combining (15) and (16), we obtain for the Gaussian case
[TABLE]
and we denote .
Let us now consider the probability in terms of for the Laplace case which is a little bit more involved since the sum of two Laplace random variables is not Laplace anymore. We decompose
[TABLE]
Consider only the first probability on the right-hand side, the bound for the second one following analogously. By Bernstein’s inequality (see Lemma D.1, take the version with control on the moments applied with , and )
[TABLE]
and hence by using ,
[TABLE]
Finally, we obtain with that
[TABLE]
in the Laplace case. Note that
[TABLE]
for both cases with different choices of . Now,
[TABLE]
For sufficiently small 444Our calculations show that has to satisfy also that . Such a choice is possible whenever which holds for large enough., we have
[TABLE]
Recall that . Thus,
[TABLE]
The sums on the right-hand side converge and the bound for the case is negligible with respect to the upper bound .
Appendix D Bernstein inequality
The following version of the Bernstein inequality is taken from [Com15].
Lemma D.1**.**
Let be i.i.d. random variables and put . Then, for any ,
[TABLE]
where and (or ).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[BCR 84] Christian Berg, Jens Peter Reus Christensen and Paul Ressel “Harmonic analysis on semigroups” Theory of positive definite and related functions 100 , Graduate Texts in Mathematics Springer-Verlag, New York, 1984, pp. x+289 DOI: 10.1007/978-1-4612-1128-0 · doi ↗
- 2[BD 14] Rina Foygel Barber and John C. Duchi “Privacy and Statistical Risk: Formalisms and Minimax Bounds” In ar Xiv-preprint, available at https://arxiv.org/abs/1412.4451 v 1 , 2014
- 3[BL 96] Lawrence D. Brown and Mark G. Low “A constrained risk inequality with applications to nonparametric functional estimation” In Ann. Statist. 24.6 , 1996, pp. 2524–2535 DOI: 10.1214/aos/1032181166 · doi ↗
- 4[BT 04] Alain Berlinet and Christine Thomas-Agnan “Reproducing kernel Hilbert spaces in probability and statistics” With a preface by Persi Diaconis Kluwer Academic Publishers, Boston, MA, 2004, pp. xxii+355 DOI: 10.1007/978-1-4419-9096-9 · doi ↗
- 5[But+19] Cristina Butucea, Amandine Dubois, Martin Kroll and Adrien Saumard “Local differential privacy: Elbow effect in optimal density estimation and adaptation over Besov ellipsoids” In ar Xiv-preprint, available at http://arxiv.org/abs/1903.01927 , 2019
- 6[But 01] Cristina Butucea “Exact adaptive pointwise estimation on Sobolev classes of densities” In ESAIM Probab. Statist. 5 , 2001, pp. 1–31 DOI: 10.1051/ps:2001100 · doi ↗
- 7[Cav 01] Laurent Cavalier “On the problem of local adaptive estimation in tomography” In Bernoulli 7.1 , 2001, pp. 63–78 DOI: 10.2307/3318602 · doi ↗
- 8[Com 15] Fabienne Comte “Estimation non-paramétrique” Paris: Spartacus, 2015
