This paper introduces a scalable method for high-dimensional nonparametric density estimation under symmetry constraints, achieving optimal risk bounds and strong finite-sample performance.
Contribution
It proposes the $K$-homothetic log-concave maximum likelihood estimator, providing risk bounds and adaptivity results for high-dimensional density estimation with symmetry assumptions.
Findings
01
Risk bound of $O(n^{-4/5})$ independent of dimension
02
Estimator adapts to special density forms for near-parametric rates
03
Algorithms are efficient for large-scale high-dimensional data
Abstract
We tackle the problem of high-dimensional nonparametric density estimation by taking the class of log-concave densities on Rp and incorporating within it symmetry assumptions, which facilitate scalable estimation algorithms and can mitigate the curse of dimensionality. Our main symmetry assumption is that the super-level sets of the density are K-homothetic (i.e. scalar multiples of a convex body K⊆Rp). When K is known, we prove that the K-homothetic log-concave maximum likelihood estimator based on n independent observations from such a density has a worst-case risk bound with respect to, e.g., squared Hellinger loss, of O(n−4/5), independent of p. Moreover, we show that the estimator is adaptive in the sense that if the data generating density admits a special form, then a nearly parametric rate may be attained. We also provide…
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Full text
High-dimensional nonparametric density estimation via symmetry and shape constraints
We tackle the problem of high-dimensional nonparametric density estimation by taking the class of log-concave densities on Rp and incorporating within it symmetry assumptions, which facilitate scalable estimation algorithms and can mitigate the curse of dimensionality. Our main symmetry assumption is that the super-level sets of the density are K-homothetic (i.e. scalar multiples of a convex body K⊆Rp). When K is known, we prove that the K-homothetic log-concave maximum likelihood estimator based on n independent observations from such a density has a worst-case risk bound with respect to, e.g., squared Hellinger loss, of O(n−4/5), independent of p. Moreover, we show that the estimator is adaptive in the sense that if the data generating density admits a special form, then a nearly parametric rate may be attained. We also provide worst-case and adaptive risk bounds in cases where K is only known up to a positive definite transformation, and where it is completely unknown and must be estimated nonparametrically. Our estimation algorithms are fast even when n and p are on the order of hundreds of thousands, and we illustrate the strong finite-sample performance of our methods on simulated data.
1 Introduction
Density estimation emerged as one of the fundamental challenges in Statistics very soon after its inception as a field. Up until halfway through the last century, approaches based on parametric (often Gaussian) assumptions or histograms/contingency tables were dominant (Fisher, 1922, 1925). However, the restrictions of these techniques has led, since the 1950s, to an enormous research effort devoted to exploring nonparametric methods, primarily based on smoothness assumptions, but also on shape constraints. These include kernel density estimation (Rosenblatt, 1956; Wand and Jones, 1995), wavelets (Donoho et al., 1996) and other orthogonal series methods, splines (Gu and Qiu, 1993), as well as techniques based on monotonicity (Grenander, 1956), log-concavity (Cule et al., 2010) and others. Although highly successful for low-dimensional data, these approaches all encounter two serious difficulties in moderate- or high-dimensional regimes: first, theoretical performance is limited by minimax lower bounds that characterise the ‘curse of dimensionality’ (e.g. Ibragimov and Khasminskii, 1983); and second, computational issues may become a bottleneck, often exacerbated by the need to choose (multiple) smoothing parameters.
In parallel to these developments, modern technology now allows the routine collection of extremely high-dimensional data sets, leading to a great demand for reliable and scalable density estimation algorithms. To emphasise the challenge here, let Fp denote the class of upper semi-continuous, log-concave densities on Rp, and let X1,…,Xn be independent and identically distributed random vectors with density f0∈Fp. Kim and Samworth (2016) proved that for each p∈N, there exists cp>0 such that333In fact, just prior to completion of this work, Dagan and Kur (2019) showed that cp may be chosen independent of p.
[TABLE]
where dH2(f,g):=∫Rp(f1/2−g1/2)2 denotes the squared Hellinger distance between densities f and g on Rp, and where the infimum is taken over all estimators f~n of f0 based on X1,…,Xn. This suggests that very large sample sizes would be required for an adequate approximation to the true density, even for p=5. In view of these fundamental theoretical limitations, it is natural to consider imposing additional structure on the problem, while simultaneously seeking to retain the desirable flexibility of the nonparametric paradigm.
In this paper, we propose a new method for high-dimensional, nonparametric density estimation by incorporating symmetry constraints into the shape-constrained class. We demonstrate that this approach facilitates efficient algorithms that in some cases can even evade the curse of dimensionality in terms of its rate of convergence. The particular type of symmetry constraint that we consider is what we call homotheticity, where the super-level sets of the density are scalar multiples of each other. Thus, any elliptically symmetric density, for instance, is homothetic, but the class is of course much broader than this. We combine homotheticity with the shape constraint of log-concavity, which in particular ensures that the super-level sets are convex, compact sets, to yield a flexible yet practical class that facilitates nonparametric density estimation even in moderate or high dimensional problems.
To introduce our contributions, let K denote the set of compact, convex sets K⊆Rp containing [math] as an interior point, and let Φ denote the set of upper semi-continuous, decreasing, concave functions ϕ:[0,∞)→[−∞,∞). In Section 2, we show that we can write any homothetic log-concave density f as f(⋅)=eϕ(∥⋅−μ∥K) for some super-level setK∈K, centering vectorμ∈Rp and generatorϕ∈Φ, and where ∥⋅∥K denotes the Minkowski functional with respect to K, whose definition we recall in (2) below. We thus write FpK for the class of homothetic log-concave densities and, for fixed K and μ, write FpK,μ for the subclass of FpK consisting of homothetic log-concave densities with super-level set K and centering vector μ. Writing Pp for the class of probability distributions P on Rp with finite mean μ∈Rp and P({μ})<1, we further prove that for fixed K and μ, there exists a well-defined homothetic, log-concave projection ψ∗≡ψK,μ∗:Pp→FpK,μ. Thus, if X1,…,Xn∼iidP∈Pp, with empirical distribution Pn, then a natural estimator of ψ∗(P) is given by f^n:=ψ∗(Pn). In particular, if P has a density f0∈FpK,μ, then ψ∗(P)=f0, and the first main aim of our theoretical contribution is to study the performance of f^n as an estimator of f0. On the other hand, if K and μ are unknown, then we investigate a computationally-efficient plug-in approach where we first construct estimators K^ of K and μ^ of μ, and then (with a slight abuse of notation) compute f^n:=ψK^,μ^∗(Pn).
To this end, we focus in Section 3 on the case where K and μ are assumed to be known and where an attractive feature of our estimator is that, even in high-dimensional problems, it does not require the choice of any tuning parameters. Our results on the theoretical performance of f^n are presented in terms of the divergence measure
[TABLE]
We show in Proposition 5(iv) that dX2(f^n,f0)≥∫Rpf^nlog(f^n/f0)=:dKL2(f^n,f0), so that our upper bounds on EdX2(f^n,f0) immediately yield the same upper bounds on the expected Kullback–Leibler divergence (as well as the risks in the squared total variation and squared Hellinger distances, for instance). One of our main results in this section (Theorem 7) is that, if X1,…,Xn∼iidf0∈FpK,μ, then there exists a universal constant C>0 such that
[TABLE]
Thus, there is no dependence on p (or K) in this worst case risk bound, which is verified by our empirical studies in Section 6. We also elucidate the adaptation behaviour of f^n. More precisely, for k∈N, we let Φ(k) denote the set of ϕ∈Φ that are piecewise linear on dom(ϕ):={r:ϕ(r)>−∞}, with at most k linear pieces. In Theorem 8, we show that if f0∈FpK,μ is of the form f0(⋅)=eϕ0(∥⋅−μ∥K) for some ϕ0∈Φ(k), then
[TABLE]
for a universal constant C>0. This result reveals that f^n adapts to densities f0 whose corresponding ϕ0 is k-affine with k not too large; in particular, an almost parametric rate can be attained for small k. In fact, we prove a stronger, oracle inequality, version of (1), where f0∈FpK,μ need only be close to a density of the form eϕ(∥⋅−μ∥K) for some ϕ∈Φ(k), and the bound incurs an additional approximation error term.
In Section 4, we consider the case where the super-level set K and the centering vector μ are unknown. We first obtain a general purpose bound on the squared Hellinger risk of f^n for arbitrary estimators K^ and μ^ in terms of deviations between K^ and K and μ^ and μ. As an initial application, we take the semiparametric setting and suppose that K=Σ01/2K0 for some known balanced K0∈K and some unknown positive definite matrix Σ0∈Rp×p; one important example of this setting is elliptical symmetry, where K0 is the unit Euclidean ball. Then we estimate K by K^=Σ^1/2K0 where Σ^ is the sample covariance matrix and estimate μ by the sample mean μ^. This yields a worst-case squared Hellinger risk bound of order p3/2/n1/2 up to polylogarithmic factors, and moreover we obtain adaptation rates of order n−4/5+p3/n and p3/n in cases where f0(⋅)=eϕ0(∥⋅−μ∥K) with ϕ0 smooth and 1-affine respectively, again up to polylogarithmic factors. In a second application, we consider the nonparametric setting where K∈K is arbitrary. Here, we propose a new algorithm to estimate K as the convex hull of estimates of its boundary at a set {θm:m=1,…,M} of randomly chosen directions, where these boundary estimates are obtained as the average Euclidean norm of observations lying in a cone around θm. The resulting estimator f^n is shown to have a worst-case squared Hellinger risk bound of order n−1/(p+1), which improves to n−2/(p+1) in cases where f0(⋅)=eϕ0(∥⋅−μ∥K) with ϕ0 smooth, up to polylogarithmic factors. Importantly, this estimator is computable even in high-dimensional settings because there is no need to enumerate the facets of K^; in order to evaluate f^n(x∗) for some x∗∈Rp, we need only compute ∥x∗∥K^, which can be achieved by a simple linear programme owing to the representation of K^ as a convex hull of M points.
Section 6 provides a simulation study that illustrates our theoretical results and confirms the computational feasibility of our estimators. Proofs of some of our main results are given in the Appendix; several other proofs, as well as auxiliary results, are given in the online supplement. Results in the online supplement have an ‘S’ prefixing their label numbers.
Other recent work on estimation over the class Fp of log-concave densities on Rp includes Robeva et al. (2018), Carpenter et al. (2018), Feng et al. (2018) and Dagan and Kur (2019); see Samworth (2018) for a review. Multivariate shape-constrained density estimation has also been considered over other classes, including the set of block decreasing densities on [0,1]p (Polonik, 1995, 1998; Biau and Devroye, 2003; Gao and Wellner, 2007), and the class of s-concave densities on Rp (Doss and Wellner, 2016; Han and Wellner, 2016). In the former case, for uniformly bounded densities, Biau and Devroye (2003) established a minimax lower bound in total variation distance of order n−1/(p+2), while in the latter case, the main interest has been in the classes with s<0, which contain the class Fp, so the same minimax lower bounds apply as for Fp. Various other simplifying structures and methods have also been considered for nonparametric high-dimensional density estimation, including kernel approaches for forest density estimation (Liu et al., 2011) and star-shaped density estimation (Liebscher and Richter, 2016), as well as nonparametric maximum likelihood methods for independent component analysis (Samworth and Yuan, 2012). Perhaps most closely related to this work is the approach of Bhattacharya and Bickel (2012), who consider a maximum likelihood approach (as well as spline approximations) to estimating the generator of an elliptically symmetric distribution with decreasing generator.
Notation: Given n∈N, we write [n]:={1,…,n}. Given a,b∈R, we write a∨b:=max(a,b) and a∧b:=min(a,b). We also say that a≲b if there exists a universal constant C>0 such that a≤Cb and, given also some quantity s, that a≲sb if there exists some Cs>0, depending only on s, such that a≤Csb. For a given function f:A→R on some domain A, let ∥f∥∞:=supx∈A∣f(x)∣; for a Borel measurable function f:A→R, we let ∥f∥esssup denote the (Lebesgue) essential supremum. Additionally, if A=R and f is a density, then we let μf:=∫−∞∞xf(x)dx and \sigma_{f}:=\bigl{\{}\int_{-\infty}^{\infty}(x-\mu_{f})^{2}f(x)\,dx\bigr{\}}^{1/2} be the mean and standard deviation of f respectively. If A⊆Rp, we let B(A) be the set of Borel measurable subsets of A. If A∈B(Rp), we let λp(A) denote the Lebesgue measure of A. We let Bp(0,1) denote the unit ball in Rp and write κp:=λp(Bp(0,1)). If x∈Rn is a vector, then we let ∥x∥ denote its ℓ2 norm. If B∈Rm×n, then we let ∥B∥op:=supx∈Rn:∥x∥=1∥Bx∥ be its operator norm. We let Sp×p denote the set of positive definite p×p matrices. For a set D⊆Rp, we let conv(D) and denote its convex hull, and if D is convex, we let ∂D denote its boundary.
2 Minkowski functionals, homothetic log-concave densities and projections
2.1 Minkowski functionals
In this section, we introduce notation and basic results that will be used throughout the paper.
Definition:
Let K:={K⊆Rp:K convex, compact,0∈int(K)}. For K∈K, we define the Minkowski functional∥⋅∥K:Rp→[0,∞) as
[TABLE]
The proposition below presents some standard properties of the Minkowski functional. In particular, ∥⋅∥K is not necessarily a norm but it is subadditive and positively homogeneous.
Proposition 1**.**
Let K∈K and x,y∈Rp. Then
(i)
∥x∥K<∞,
2. (ii)
x∈K* if and only if ∥x∥K≤1,*
3. (iii)
x∈∂K* if and only if ∥x∥K=1, where ∂K:=K∖int(K).*
4. (iv)
∥x+y∥K≤∥x∥K+∥y∥K* and, if α≥0, then ∥αx∥K=α∥x∥K.*
In fact, if K=−K, then ∥⋅∥K is a norm; as a special case, if K is the closed Euclidean unit ball, then ∥⋅∥K coincides with the Euclidean norm. Conversely, a norm is also a Minkowski functional: let ∥⋅∥ be a norm and define a convex body K={x∈Rp:∥x∥≤1}, then we have that ∥x∥=∥x∥K for all x∈Rp.
2.2 Homothetic log-concave densities
We say that a density f on Rp is homothetic if there exist a decreasing function r:(0,∥f∥∞)→[0,∞), a measurable subset A of Rp with 0∈int(A) and μ∈Rp such that {x:f(x)≥t}=r(t)A+μ for every t∈(0,∥f∥∞). Note that any such set A has the property that if 0<t1≤t2<∥f∥∞, then r(t1)A⊇r(t2)A.
In fact, this definition also characterises the level set of f at ∥f∥∞ since, for any sequence tn∈(0,∥f∥∞) such that tn↗∥f∥∞,
[TABLE]
We have that \bigcap_{n=1}^{\infty}r(t_{n})A+\mu\supseteq\bigl{(}\lim_{n}r(t_{n})\bigr{)}A+\mu, with equality in certain cases. For example, if limnr(tn)=0, then equality holds when A is bounded; if limnr(tn)>0, then equality holds when A is closed, which occurs when, e.g., f is upper semi-continuous.
Recall that Φ denotes the set of upper semi-continuous, concave, decreasing functions ϕ:[0,∞)→[−∞,∞). The following proposition characterises densities on Rp that are simultaneously homothetic and log-concave.
Proposition 2**.**
Let f be an upper semi-continuous density on Rp. Then f is homothetic and log-concave if and only if there exist K∈K, μ∈Rp and ϕ∈Φ such that f(⋅)=eϕ(∥⋅−μ∥K). If f has an alternative representation as f(⋅)=eϕ~(∥⋅−μ~∥K~), where K~∈K, μ~∈Rp and ϕ~∈Φ, then there exist σ>0, σ′>0 such that K~=σK+σ′(μ−μ~) and ϕ~(⋅)=ϕ(σ⋅); moreover, if f is not the uniform distribution, then μ~=μ.
Proposition 2 states that any upper semi-continuous, homothetic, and log-concave density may be parametrised by a generator ϕ∈Φ, a super-level set K∈K, and a centering vector μ∈Rp. Moreover, as long as f is not the uniform distribution, the only degree of non-identifiability is that we may scale K and horizontally dilate ϕ by the same scalar σ>0. This degree of non-identifiability is in fact helpful for density estimation because we need only estimate Kup to a scaling factor in order to estimate the density f.
We let FpK denote the set of all upper semi-continuous, homothetic, log-concave densities on Rp, and for K∈K and μ∈Rp, let FpK,μ denote the set of K-homothetic, log-concave densities of the form given in Proposition 2. We also write FpK:=FpK,0. The following proposition can be regarded as an analogue of a known characterisation of elliptically symmetric densities (where K is taken to be an ellipsoid) to the general homothetic, log-concave case.
Proposition 3**.**
Let f∈FpK be of the form f(⋅)=eϕ(∥⋅∥K), for some K∈K and ϕ∈Φ. Let R be a random variable taking values in [0,∞) with density h, where h(r):=pλp(K)rp−1eϕ(r) for r∈[0,∞), and let Z be a random vector, independent of R, uniformly distributed on K. Then ∥Z∥KZR has density f.
Remark:
The random vector Z/∥Z∥K is supported on the boundary of K. When K is the unit Euclidean ball in Rp, we have that Z/∥Z∥K is uniformly distributed on the surface of the unit Euclidean sphere. However, when K is an arbitrary convex body, Z/∥Z∥K is generally not distributed uniformly on the surface ∂K. As a simple example in R2, we may take K=B2(0,1)∩{(x1,x2):∣x1∣≤2−1/2}. The probability that Z/∥Z∥K lies on the line segment {(2−1/2,x2):x2∈[−2−1/2,21/2]} is 1/(2+π), whereas the length of the line segment divided by the perimeter of K is 1/(2+2−1/2π).
2.3 Projections onto the class of homothetic, log-concave densities
In this section, we fix K∈K and consider projections onto FpK. For ϕ∈Φ and a probability measure P on Rp, we define
[TABLE]
and write φ∗(P)≡φK∗(P):=argmaxϕ∈ΦL(ϕ,P). Since L(⋅,P) is strictly concave, any maximiser of L(⋅,P) over Φ is unique.
If L(ϕ,P)∈R, then
[TABLE]
so L(ϕ+c,P) is maximised by choosing c=-\log\bigl{(}p\lambda_{p}(K)\int_{0}^{\infty}r^{p-1}e^{\phi(r)}\,dr\bigr{)}. It follows that if φ∗(P) exists, and if L(φ∗(P),P)∈R, then we can define the K-homothetic log-concave projectionf∗(P)≡fK∗(P)∈FpK by f∗(P)(⋅):=eφ∗(P)(∥⋅∥K). When the centering vector μ is not the origin, the projection of a probability measure P onto FpK,μ may be reduced to the case where μ=0 by translating the probability measure P by −μ, projecting the translated distribution onto FpK, and then translating the resulting log-concave density back by μ.
By Proposition 2, for any α>0, it holds that FpK=FpαK; we therefore need to check that fK∗(P) does not depend on the choice of K. To see this, fix α>0 and ϕ∈Φ, and define ϕα:[0,∞)→[−∞,∞) by ϕα(r):=ϕ(αr) for r∈[0,∞). Observe then that LK(ϕ,P)=LαK(ϕα,P) and hence, if we write ϕ∗:=φK∗(P), then ϕα∗=φαK∗(P) and therefore, fK∗(P)=fαK∗(P) as desired.
In fact, in order to study φ∗(P), it will be convenient also to define a related projection of a one-dimensional probability distribution. To this end, for a≥0, let Φa:={ϕ(⋅−a):ϕ∈Φ} and set
[TABLE]
Here, we incorporate the greater generality of the translation by a in order to facilitate our analysis of the adaptivity properties of the K-homothetic log-concave MLE in Section 3.2. We continue to write Φ=Φ0 and also write H=H0 as shorthand. For a probability measure Q on [a,∞) and ϕ∈Φa, let
[TABLE]
Similarly to before, we let ϕ∗(Q)≡ϕa,K∗(Q):=argmaxϕ∈ΦaL(ϕ,Q). Again, any maximiser ϕ∗(Q) of L(⋅,Q) over Φa is unique, and if ϕ∗(Q) exists with L(ϕ∗(Q),Q)∈R, then, writing h∗(Q)(r)≡ha,K∗(Q)(r):=pλp(K)rp−1eϕ∗(Q)(r), we have that h∗(Q)∈Ha, so in particular, h∗(Q) is a (log-concave) density.
The following proposition gives necessary and sufficient conditions for the K-homothetic log-concave projection to be well-defined. We write Pp for the set of probability distributions on Rp with ∫Rp∥x∥KdP(x)<∞ and P({0})<1; the first of these conditions is equivalent to ∫Rp∥x∥dP(x)<∞. We let Qa denote the class of probability measures Q on [a,∞) with ∫a∞rdQ(r)<∞ and Q({a})<1, and let Q:=Q0.
Proposition 4**.**
We have
(i)
if ∫a∞rdQ(r)=∞, then L(ϕ,Q)=−∞ for all ϕ∈Φa;
2. (ii)
if Q({a})=1, then supϕ∈ΦaL(ϕ,Q)=∞;
3. (iii)
if Q∈Qa, then supϕ∈ΦaL(ϕ,Q)∈R and ϕa∗ is a well-defined map from Qa to Φa;
4. (iv)
if P is a probability measure on Rp and we define a probability measure Q on [0,∞) by Q\bigl{(}[0,r)\bigr{)}:=P(\{x:\|x\|_{K}<r\}), then L(ϕ,P)=L(ϕ,Q) for every ϕ∈Φ. In particular, if P∈Pp, then Q∈Q and φ∗(P)=ϕ∗(Q).
Remark:
From Proposition 4(iv), we see that the conditions on P required for the K-homothetic log-concave projection to exist, namely ∫Rp∥x∥KdP(x)<∞ and P({0})<1, are weaker than the corresponding conditions for the ordinary log-concave projection to exist, namely ∫Rp∥x∥dP(x)<∞ and P(H)<1 for every hyperplane H; cf. Dümbgen et al. (2011, Theorem 2.2).
The next proposition gives some basic properties of the K-homothetic log-concave projection.
Proposition 5**.**
Let Q∈Qa, let ϕ∗:=ϕa∗(Q) and let h∗(r):=pλp(K)rp−1eϕ∗(r) for r∈[a,∞).
(i)
The projection ϕa∗(⋅) is scale equivariant in the sense that if α>0, and Qα∈Qαa is defined by Q_{\alpha}\bigl{(}[\alpha a,r)\bigr{)}:=Q\bigl{(}[a,r/\alpha)\bigr{)} for all r∈[αa,∞), then ϕαa∗(Qα)(r)=ϕa∗(Q)(r/α)−plogα, and therefore hαa∗(Qα)(r)=(1/α)ha∗(Q)(r/α).
2. (ii)
Let Δ:[a,∞)→[−∞,∞) be a function satisfying the property that there exists t>0 such that
r↦ϕ∗(r)+tΔ(r)∈Φa. Then
[TABLE]
3. (iii)
∫a∞rh∗(r)dr≤∫a∞rdQ(r).
4. (iv)
For any h0∈Ha, we have ∫a∞h∗log(h∗/h0)≤∫a∞log(h∗/h0)dQ.
Remark:
Proposition 5(iii) reveals a difference between the K-homothetic log-concave projection, and the ordinary log-concave projection, which preserves the mean (Dümbgen et al., 2011, Remark 2.3). Lemma S2 provides control on the extent to which the mean is shrunk by the K-homothetic log-concave projection in the special case where Q is an empirical distribution.
In particular, consider X1,…,Xn∼iidP∈Pp with empirical distribution Pn, and for A∈B(R), let Q(A):=P({x:∥x∥K∈A}). Let Zi:=∥Xi∥K for i∈[n] and let Qn denote the empirical distribution of Z1,…,Zn. Writing f^n:=f∗(Pn), f∗:=f∗(P), h^n:=h∗(Qn) and h∗:=h∗(Q), we have by Proposition 5(iv) and Lemma S1 that
[TABLE]
As a final basic property of our projections, we establish continuity with respect to the Wasserstein distance. Recall that if P,P′ are probability measures on a Euclidean space with finite first moments, then the Wasserstein distance between P and P′ is defined as
[TABLE]
where the infimum is taken over all pairs of random vectors X,X′, defined on the same probability space, with X∼P and Y∼Q. We also recall that if P has a finite first moment, then dW(Pn,P)→0 if and only if both Pn→dP and ∫Rp∥x∥dPn(x)→∫Rp∥x∥dP(x).
Proposition 6**.**
Suppose that P∈Pp and dW(Pn,P)→0. Then supϕ∈ΦL(ϕ,Pn)→supϕ∈ΦL(ϕ,P), f∗(Pn) is well-defined for large n and
[TABLE]
Moreover, given any compact set D contained in the interior of the support of f∗(P), we have supx∈D∣f∗(Pn)(x)−f∗(P)(x)∣→0.
Remark:
Proposition 6 immediately yields a consistency (and robustness to misspecification) result for the K-homothetic log-concave MLE. In particular, suppose that X1,…,Xn∼iidP∈Pp with empirical distribution Pn, and let f^n:=f∗(Pn), f∗:=f∗(P). Then, by the strong law of large numbers and Varadarajan’s theorem (e.g. Dudley, 2002, Theorem 11.4.1), we have dW(Pn,P)→a.s.0, so
[TABLE]
Remark:
In fact, the conclusion of Proposition 6 also holds for stronger norms than the total variation norm. In particular, taking a0>0 and b0∈R such that f∗(P)(x)≤e−a0∥x∥+b0 for all x∈Rp, we have by, e.g., Cule and Samworth (2010, Proposition 2) that for every a<a0,
[TABLE]
3 Risk bounds when K is known
In this section we continue to consider K∈K as fixed (and known) and μ=0. Let f0∈FpK, and suppose that X1,…,Xn∼iidf0 with empirical distribution Pn. Let f^n:=f∗(Pn) be the K-homothetic log-concave MLE. By Proposition 4(iv), we may compute f^n efficiently by first computing Zi=∥Xi∥K for i∈[n] and then, writing Qn for the empirical distribution of Z1,…,Zn, computing ϕ^n:=ϕ∗(Qn). Our final estimate is f^n:=eϕ^n. We defer algorithmic details to Section 5.
3.1 Worst-case bound
Our first main result below provides a worst-case risk bound for f^n as an estimator of f0 in terms of the dX2 divergence.
Theorem 7**.**
Let X1,…,Xn∼iidf0∈FpK with empirical distribution Pn. Let f^n:=f∗(Pn) be the K-homothetic log-concave MLE. There exists a universal constant C>0 such that for n≥8,
[TABLE]
As mentioned in the introduction, the attractive aspect of this bound is that it does not depend on p. The proof relies heavily on the special moment preservation properties of the K-homothetic log-concave MLE developed in Lemmas S2, S3 and S5.
3.2 Adaptive bounds
We now turn to the adaptation properties of f^n. For k∈N and a≥0, we say ϕ∈Φa is k-affine, and write ϕ∈Φa(k), if there exist r0∈(a,∞] and intervals I1,…,Ik with Ij=[aj−1,aj] for some a=a0<a1<…<ak=r0 and such that ϕ is affine on each Ij for j∈[k], and ϕ(r)=−∞ for r>r0. Define \mathcal{H}_{a}^{(k)}:=\bigl{\{}h\in\mathcal{H}_{a}\,:\,h(r)=p\lambda_{p}(K)r^{p-1}e^{\phi(r)}\textrm{ for some }\phi\in\Phi_{a}^{(k)}\bigr{\}}. Again, we write Φ(k):=Φ0(k) and H(k):=H0(k).
Theorem 8**.**
Let f0∈FpK be given by f0(⋅)=eϕ0(∥⋅∥K) for some ϕ0∈Φ, and let X1,…,Xn∼iidf0 with empirical distribution Pn. Let f^n:=f∗(Pn) be the K-homothetic log-concave MLE. Define h0∈H by h0(r):=pλp(K)rp−1eϕ0(r) for r∈[0,∞). Then, writing νk:=21/2∧infh∈H(k)dKL(h0,h), there exists a universal constant C>0 such that for n≥8,
[TABLE]
Remark:
Taking the universal constant C>0 from the conclusion of Theorem 8 and setting C∗:=max{(3C/2)5/4,1}, we see that if k∈[n] and if \nu_{k}^{2}\geq C_{*}\frac{k}{n}\log^{5/4}\bigl{(}\frac{en}{k}\bigr{)}, then
[TABLE]
On the other hand, if k∈[n] and if \nu_{k}^{2}\leq C_{*}\frac{k}{n}\log^{5/4}\bigl{(}\frac{en}{k}\bigr{)}, then
[TABLE]
It follows that Theorem 8 implies the following sharp oracle inequality: there exists a universal constant C>0 such that
[TABLE]
The proof of Theorem 8 proceeds by first considering the case k=1, described in Proposition 9 below, for which we obtain a slightly different approximation error term.
Proposition 9**.**
Let a∈[0,∞) and suppose that Z1,…,Zn∼iidh0 for some h0∈Ha with empirical distribution function Qn, and let h^n:=ha∗(Qn). Set ν:=inf{dH(h0,h):h∈Ha(1),h0≪h}. Then there exists a universal constant C>0 such that for n≥8,
[TABLE]
Remark:
Since 21/2≤ne−3/2 for n≥8 and the function x↦x2/5log(en/x) is increasing for x≤ne−3/2, Proposition 9 remains true if we redefine ν:=21/2∧infh∈Ha(1)dKL(h0,h). Hence, the conclusion of Proposition 9 is stronger than that obtained by specialising Theorem 8 to the case k=1.
Proposition 9 is analogous to Theorem 5 in Kim et al. (2018a). However the proof does not follow from the local bracketing entropy analysis in Kim et al. (2018a, Theorem 4) because, for h∈Ha(1), logh is not an affine function when p≥2. To prove Proposition 9, then, we show in Lemma S13 that the bracketing entropy of a local Hellinger ball around an arbitrary g0∈F1 is small if we further restrict the local ball to include only g∈F1 such that log(g/g0) is concave. (Kim et al., 2018a, Theorem 4) were interested in the case where g0 is log-affine, for which log(g/g0) is necessarily concave for every log-concave g, so their result can be considered as a special case of Lemma S13.
4 Risk bounds when K is estimated
4.1 General approach
When K is unknown and needs to be estimated, one approach is to attempt to maximise (3) jointly in K and ϕ; however, this appears to be computationally infeasible. We therefore consider the following plug-in procedure, where for simplicity of exposition we assume an even sample size. Given p-dimensional random vectors X1,…,X2n, we use Xn+1,…,X2n to estimate K^ and μ^ (we give specific examples for how to estimate K^ in Sections 4.2 and 4.3 below), where we think of K as being metrised by the Hausdorff metric, and equip K with the σ-algebra induced by this metric. We then form Z~i:=∥Xi−μ^∥K^ for i∈[n] and, writing Q~n for the empirical distribution of Z~1,…,Z~n, compute ϕ^n:=ϕK^∗(Q~n). Our final density estimate is f^n(⋅):=eϕ^n(∥⋅−μ^∥K^).
Our goal in this section is to analyse the performance of the plug-in procedure without restricting our attention to any specific estimators K^ and μ^. To do this, we assume that X1,…,X2n are generated independently from a density f0∈FpK of the form f0(⋅)=eϕ0(∥⋅−μ∥K) and then bound the Hellinger error dH(f^n,f0) in terms of the deviations between K^ and K and between μ^ and μ. To that end, for K1,K2∈K, define a pseudo-metric
[TABLE]
This notion of distance satisfies all of the axioms for being a metric except for the triangle inequality; it is also scale invariant in the sense that dscale(γK1,γK2)=dscale(K1,K2) for any γ>0. For c1,c2>0, let
[TABLE]
Our main result in this subsection is the following:
Proposition 10**.**
Let X1,…,X2n∼iidf0∈FpK,μ, and let f^n, K^ and μ^ be as defined above. Then there exist universal constants c1,c2>0 such that
[TABLE]
Moreover, if f0∈FpK,μ is of the form f0(x)=eϕ0(∥x−μ∥K) for some ϕ0∈Φ such that ϕ0′ is absolutely continuous and such that infr∈[0,∞)ϕ0′′(r)≥−D0 for some D0>0, then
[TABLE]
Finally, if f0∈FpK,μ is of the form f0(⋅)=e−a∥⋅−μ∥K+b for some a>0 and b∈R, then
[TABLE]
The first term in the error bounds of Proposition 10 arises from the estimation of the generator ϕ0, the second term arises from the estimation of the centering vector μ, and the third term arises from the estimation of the super-level set K.
Remark:
When f0 is not a uniform density, the bounds in Proposition 10 do not depend on the choice of K in the representation of f0. More precisely, if f0(⋅)=eϕ~(∥⋅−μ∥K~) for K~∈K and ϕ~∈Φ, then, by Proposition 2, there exists γ>0 such that K~=γK. We observe then that the quantities on the right-hand sides of the inequalities in Proposition 10 do not change if K is replaced with K~. If f0 is the uniform distribution, then the centering vector μ is not unique either, and Proposition 10 applies to any choice of the centering vector μ.
The main difficulty in the proof of Proposition 10 is that we cannot make any assumptions about the density of the data points Z~1,…,Z~n since these are constructed from μ^ and K^ instead of the true μ and K. We overcome this problem through Lemma S21, where we apply empirical process theory in the presence of model misspecification.
4.2 Risk bounds when K is known up to a positive definite transformation
In this section, we let K0∈K be a balanced convex body in an isotropic position, so that K0=−K0 and λp(K0)1∫K0xx⊤dx=Ip. Let r1,r2>0 be such that r1Bp(0,1)⊆K0⊆r2Bp(0,1) and let r0:=r2/r1. Let μ∈Rp, Σ0∈Sp×p, K=Σ01/2K0, ϕ0∈Φ, and let f0∈FpK,μ be such that f0(⋅)=eϕ0(∥⋅−μ∥K). We assume that K0 is known but that Σ0 is unknown. Let Σ:=∫Rp(x−μ)(x−μ)⊤f0(x)dx, so that Σ∝Σ0.
Throughout this subsection, we assume that X1,…,Xn∼iidf0, and denote the sample covariance matrix by Σ^:=n−1∑i=1n(Xi−μ^)(Xi−μ^)⊤, where μ^:=n−1∑i=1nXi. We let K^:=Σ^1/2K0. The following proposition controls dscale(K^,K) and ∥μ^−μ∥K; this first part relies heavily on Adamczak et al. (2010, Theorem 4.1), which provides a tail bound for the operator norm of the difference between the sample covariance matrix and the identity matrix, uniformly over all isotropic log-concave densities.
Proposition 11**.**
There exists a universal constant C≥1 such that, if Cnplog3(en)≤1/2, then with probability at least 1−2/n,
[TABLE]
and
[TABLE]
Remark:
Since K0 is balanced and in an isotropic position, we know from John’s Ellipsoid Theorem that r0≤p1/2, see, e.g., Ball (1997, Theorem 3.1) or John (1948, Section 3). The extreme case where r0=p1/2 is realised when K0=[−1,1]p. For a general K0 however, r0 may be much smaller.
The following corollary is therefore an immediate consequence of Propositions 10 and 11.
Corollary 12**.**
Let K^,μ^ be as in Proposition 11 and let f^n be as in Proposition 10. Then
[TABLE]
Moreover, if f0∈FpK,μ is of the form f0(⋅)=eϕ0(∥⋅−μ∥K) for some ϕ0∈Φ such that ϕ0′ is absolutely continuous and that infr∈[0,∞)ϕ0′′(r)≥−D0 for some D0>0, then
[TABLE]
Finally, if f0∈FpK,μ is of the form f0(⋅)=e−a∥⋅−μ∥K+b for some a>0 and b∈R, then
[TABLE]
Thus, in particular, Corollary 12 provides risk bounds for estimating elliptically symmetric log-concave densities, where we may take r0=1.
4.3 Risk bounds for general K
For simplicity, we assume that μ=0 in this section. In the case where μ is unknown and K is balanced, we may estimate μ with the empirical mean n−1∑i=1nXi. Our algorithm for estimating K proceeds by estimating the boundary of K at a set of randomly chosen directions and then outputting the convex hull of the estimated boundary points.
The set {θm/∥θm∥K}m=1,…,M contains random points on the boundary of K. It is shown in Lemma S36 that \max_{m\in[M]}\bigl{|}t_{m}-1/\|\theta_{m}\|_{K}\bigr{|} is small, which is allows us to control the error of the approximation of K by K^. Figure 1 illustrates the behaviour of the algorithm when p=2. Recall the definition of dscale from (7); the next Proposition bounds the deviation infα>0dscale(αK^,K).
Proposition 13**.**
Suppose that p≥2, K∈K, that there exist r2≥r1>0 such that r1Bp(0,1)⊆K⊆r2Bp(0,1), and write r0:=r2/r1. Suppose further that r02n−p+11log3(en)≤1/64 and let M:=\bigl{\lceil}n^{\frac{p-1}{p+1}}\bigr{\rceil}. Let X1,…,Xn,Xn+1,…,Xn+M∼iidf0∈FpK and let K^ be the output of Algorithm 1. Then, there exists a constant Cp,r0>0 depending only on r0 and p such that with probability at least 1−Cp,r0n−p+1p,
[TABLE]
Remark:
If K is in an isotropic position, then we have that r0≤p by John’s Ellipsoid Theorem (John, 1948, Section 3). If K is additionally balanced, then we know from the same theorem that r0≤p1/2 (see the remark after Proposition 11). Therefore, we recommend that Algorithm 1 be applied to whitened data X~i=Σ^−1/2(Xi−μ^) where Σ^ and μ^ are the sample covariance matrix and the sample mean respectively. This transformation brings K into an approximately isotropic position, and the error in this approximation is given in Proposition 11.
Remark:
Proposition 13 has connections with an extensive line of research on the estimation of a convex body K from observations supported on K; see, e.g., Bronstein (2008) or Brunel (2018a) for introductions to the field.
Corollary 14**.**
Let K^ be defined as in Proposition 13 and let f^n be defined as in Proposition 10. Then
[TABLE]
Moreover, if f0∈FpK,μ is of the form f0(⋅)=eϕ0(∥⋅−μ∥K) for some ϕ0∈Φ such that ϕ0′ is absolutely continuous and such that infr∈[0,∞)ϕ0′′(r)≥−D0 for some D0>0, then
[TABLE]
Remark:
It is important to observe that the computation of the estimator f^n is scalable for large n and p. Computing K^ requires at most O(pn2logn) operations, since we represent K^ implicitly in terms of its hull vertices and have no need to enumerate its facets. For any x∈Rp, we may then compute ∥x∥K^ through a straightforward linear programme of at most n+1 variables; see (11). Thus, it is also fast to compute ϕ^n and to evaluate f^n(⋅)=eϕ^n(∥⋅∥K^) at any x∈Rp.
5 Algorithm
In this section, we assume K^∈K, μ^∈Rp and data X1,…,Xn∈Rp with empirical distribution Pn are given, and describe an efficient algorithm for computing the K^-homothetic log-concave projection of Pn. Fixing x∈Rp, we first note that in many cases of interest, the Minkowski functional ∥x−μ^∥K^ is easy to compute when K^ is constructed using the estimation schemes described in Section 4. In particular, if K^ is of the form Σ^1/2K0, where K0 is a known convex body whose Minkowski functional is simple to compute, and where Σ^∈Sp×p, as is the case in Section 4.2, then ∥x−μ^∥K^=∥Σ^−1/2(x−μ^)∥K0, so it may also be computed easily. As another example, if K^ is the convex hull of a set of points {Y1,…,YM} in Rp, as is the case in Section 4.3, then ∥x−μ^∥K^ is the solution to the following linear programme:
[TABLE]
Let Zi:=∥Xi−μ^∥K^ for i∈[n], and let Qn denote the empirical distribution of Z1,…,Zn. Proposition 4 shows that, provided at least one of Z1,…,Zn is non-zero, the function ϕ^n:=ϕK^∗(Qn) is well-defined, and we can then set f^n(⋅):=eϕ^n(∥⋅−μ^∥K^). Our aim is therefore to provide an algorithm for computing ϕ^n.
Let Φˉ denote the set of ϕ∈Φ with the property that ϕ is constant on the interval [0,Z(1)] and affine on the intervals [Z(i−1),Z(i)] for i=2,…,n, with ϕ(r)=−∞ for r>Z(n). Observe that if we fix ϕ∈Φ, and ϕˉ∈Φˉ be such that ϕˉ(0)=ϕ(0) and ϕˉ(Zi)=ϕ(Zi) for all i=1,…,n. Then by concavity of ϕ, we have ϕ(r)≥ϕˉ(r) for all r∈[0,∞). Hence
[TABLE]
The volume λp(K^), in the case where K^=Σ^1/2K0 where the volume of K0 is known and Σ^∈Sp×p, takes the simple form λp(K0)det(Σ^)1/2. More generally, λp(K^) can be computed efficiently if, for any x∈Rp, the query of whether or not x∈K^ can be evaluated efficiently. When K^ is the convex hull of M points in Rp for example, we may evaluate a query by solving the linear programme (11) and then checking whether the solution uM+1 is less than or equal to 1. If we let q denote the number of queries made by an algorithm, then Kannan et al. (1997) give a Markov Chain Monte Carlo algorithm whose query complexity is bounded by O(q5) up to polylogarithmic factors. In fact, the computation of the volume of a convex body is a deep and beautiful problem that had been studied intensely by the theoretical computer scientists since the seminal paper of Dyer et al. (1991), who first gave a polynomial time algorithm for the problem. It is one of few instances in computer science where all deterministic algorithms are provably intractable but efficient randomised algorithms exist. We refer readers to Simonovits (2003) for an accessible tutorial.
We now assume for simplicity of exposition that Z1,…,Zn are distinct. The more general case can be treated similarly by assigning appropriate weights to duplicated points. Any ϕ∈Φˉ can be identified with ϕ=(ϕ1,…,ϕn)⊤∈Rn given by ϕi:=ϕ(Zi) for i∈[n]. For i∈[n−1], let δi:=Z(i+1)−Z(i). Define v1=(v1,j)j=1n∈Rn to have two non-zero entries, namely v1,1:=−1, v1,2:=1. Further, for i=2,…,n−1, let vi=(vi,j)j=1n∈Rn have three non-zero entries, namely
[TABLE]
Finally, let \bar{\mathbf{\Phi}}_{n}:=\bigl{\{}\phi\in\mathbb{R}^{n}:v_{i}^{\top}\phi\leq 0\text{ for }i=1,\ldots,n-1\bigr{\}}. By (5), we see that it suffices to compute ϕ∗=(ϕ1∗,…,ϕn∗)⊤∈argmaxϕ∈ΦˉnF(ϕ), where
[TABLE]
This is a finite-dimensional convex optimisation problem with linear inequality constraints. We propose an active set algorithm for the optimisation of (13), a variant of the algorithm used in Dümbgen et al. (2007) to compute the ordinary univariate log-concave MLE. For ϕ∈Φˉn, we define A(\phi):=\bigl{\{}i\in[n-1]:v_{i}^{\top}\phi=0\bigr{\}} to be the set of ‘active’ constraints. Note that this is the complement in {1,…,n} of the set of ‘knots’ of ϕ.
Given a set A⊆[n−1], we define V(A):=\bigl{\{}\phi\in\mathbb{R}^{n}:v_{i}^{\top}\phi=0,\,\forall\,i\in A\bigr{\}}, and
[TABLE]
Here, the maximiser is unique because F(⋅) is strictly concave on Rn with F(ϕ)→−∞ as ∥ϕ∥→∞. It is convenient to define, for i∈[n−1], vectors bi=(bi,j)j=1n∈Rn by
[TABLE]
where, as usual, we interpret an empty sum as [math], and also define bn:=1n∈Rn, the all-one vector. It follows from this definition that bi⊤vi=−1 for i∈[n−1] and bi⊤vj=0 for all i∈[n] and j∈[n−1] with i=j. Finally, given ϕ∈Φˉn and ϕ′∈Rn, we define
[TABLE]
We are now in a position to present the full algorithm; see Algorithm 2. It is guaranteed to terminate in finitely many steps with the exact solution.
We complete this section by providing further detail on how to solve the optimisation problem in (14). Given the active set A⊆[n−1], let us define I:=[n]∖A. We index the elements of I by i1<…<iT where T:=∣I∣. Given v∈R(n−1)×n, we also write vA for the matrix in R∣A∣×n obtained by extracting the rows of v with indices in A. Observe that the set {ϕ=(ϕ1,…,ϕn)⊤∈Rn:vAϕ=0} is the subspace of Rn where for j<i1, we have ϕj=ϕi1, and for j∈{it+1,…,it+1−1}, we have
[TABLE]
It follows we can solve the optimisation problem (14) by solving instead an unconstrained optimisation over T variables, i.e. by computing
[TABLE]
We solve this latter problem via Newton’s method.
6 Empirical performance
We perform three sets of simulation studies. In the first set, reported in Figure 2, we choose p=100 suppose that K=Bp(0,1) and μ=0 are known. We generate X1,…,Xn∼iidf0 where we take f0(⋅)∝e−∥⋅∥K2/2, f0(⋅)∝\mathbbm1{∥⋅∥K≤p} and f0(⋅)∝e−∥⋅∥K in settings (a), (b) and (c) respectively. We then compute the homothetic log-concave MLE f^n and report the average squared Hellinger errors dH2(f^n,f0) over 50 repetitions and with n∈{2000,4000,6000,8000} in the curve labelled “HLC” Figure 2. For comparison, we also present the corresponding results with p=100,000 in the curves labelled “HLC(p=100k)”. The simulation results are in line with Theorem 7, which gives a bound on dX2(f^n,f0) that is independent of p.
We also compare the K-homothetic log-concave MLE against two alternative methods. In the first of these methods, we write Zi=∥Xi∥K for i∈[n], apply the ordinary univariate log-concave MLE to Z1,…,Zn to obtain a density h^nLC and then estimate f0 by f^nLC, where
[TABLE]
We compute the squared Hellinger errors dH2(f^nLC,f0) and report them in the curve labelled LC in Figure 2. In fact, besides the improved empirical performance of f^n observed in Figure 2, we argue that f^n has several advantages over f^nLC in this context, and list these in roughly decreasing order of importance:
The estimator f^nLC is inconsistent at x=0. Indeed h^nLC(x)=0 whenever ∥x∥K<miniZi, and the division by ∥x∥Kp−1 in (15) means that the estimator behaves poorly for small ∥x∥K; see Figure 3. By contrast, f^n is uniformly consistent over compact sets contained in the interior of the support of f0 (Proposition 6);
2. 2.
As mentioned in Section 3.2, the estimator f^n attains faster rates of convergence when the true density has a simple structure;
3. 3.
The estimator f^n takes values in the relevant class FpK, whereas f^nLC does not;
4. 4.
The estimator f^n exists in slightly greater generality than f^nLC (cf. the remark following Proposition 4(iv)).
In the second competing method, we apply a kernel density estimator (with the default settings of the density function in R) to Z1,…,Zn to obtain a density h^nker, and then estimate f0 by f^nker, where \hat{f}^{\mathrm{ker}}_{n}(x):=\hat{h}^{\mathrm{ker}}_{n}(\|x\|_{K})/\bigl{\{}p\lambda_{p}(K)\|x\|_{K}^{p-1}\bigr{\}}. We compute the squared Hellinger errors dH2(f^nker,f0) and report them in the curve labelled ‘ker’ in Figure 2. The squared Hellinger errors for the kernel density estimator f^ker do not appear in Figure 2(b) because the errors are greater than 0.007 and therefore much larger than those of f^n and f^nLC.
In the second set of simulations, reported in Figure 4, we consider the semiparametric setting where K=Σ01/2K0 for some known K0∈K and unknown Σ0∈Sp×p. We estimate Σ0 up to a scaling factor by the empirical covariance matrix Σ^, take K^ to be Σ^1/2K0, and estimate the centering vector μ (taken to be 0) by the empirical mean vector μ^. We then construct Z~i:=∥Xi−μ^∥K^ for i∈[n], compute ϕ^n:=argmaxϕ∈Φn−1∑i=1nϕ(Z~i)−pλp(K^)∫0∞rp−1eϕ(r)dr, and construct the density estimate f^n(⋅)=eϕ^n(⋅). In all cases, we generate Σ0 as UDU⊤, where U is generated according to Haar measure on the set of orthogonal p×p real matrices, and where D∈Rp×p is a diagonal matrix whose jth diagonal entry is 1.2j. In Figures 4(a) and 4(c), we take K0=Bp(0,1), while in Figures 4(b) and 4(d), we take K0=[−1,1]p. In Figures 4(a) and 4(b), we fix p=40 and report the squared Hellinger errors dH2(f^n,f0) for n∈{2000,5000,10000,20000}, while in Figures 4(c) and 4(d), we fix n=10000 and report the corresponding squared errors with p∈{10,20,40,60}. In the settings of Figures 4(a) and 4(c), we see the advantage conferred by the smoothness of ϕ0, in line with our theoretical guarantees from Corollary 12. On the other hand, when K0=[−1,1]p in Figures 4(b) and 4(d), r0 is much larger than in the K0=Bp(0,1) case (it is equal to p1/2 instead of 1), and this makes the problem significantly harder, which is again in agreement with Corollary 12.
Finally, in the third set of simulations, given in Figure 5, we take μ=0 to be known and estimate K nonparametrically using Algorithm 1, with M=⌈np+1p−1⌉. As with the previous set of simulations, once we obtain K^, we construct Z~i:=∥Xi∥K^ for i∈[n], compute ϕ^n:=argmaxϕ∈Φn−1∑i=1nϕ(Z~i)−pλp(K^)∫0∞rp−1eϕ(r)dr, and set f^n=eϕ^n. The choices of K were the same as those for K0 in the corresponding panels of Figure 4. In Figures 5(a) and 5(b), we take p=6 and n∈{6000,12000,24000,36000}, while in Figures 5(c) and 5(d), we fix n=24000 and take p∈{2,4,6,8}. We observe similar phenomena to those seen in the case where K is known up to a positive definite transformation.
Supplementary material to ‘High-dimensional nonparametric density estimation via symmetry and shape constraints’
(i) This follows from the assumption that 0∈int(K).
(ii) This follows from, e.g., Rockafellar (1997, Corollary 9.7.1).
(iii) Let x∈Rp and suppose that ∥x∥K<1. Then there exists ϵ>0 such that x∈(1−ϵ)K by the second claim. Since K is convex, we have that x+ϵK⊆K. Moreover, since K contains an open neighbourhood of 0, we see that x is an interior point of K. Conversely, if x is an interior point of K, then there exists ϵ>0 such that Bp(x,ϵ)⊆K. Hence there exists x′∈K with ∥x′∥K>∥x∥K, so the conclusion follows from (ii).
(iv) See Royden and Fitzpatrick (2010, Proposition 14.24).
∎
Any density of the form f(⋅)=eϕ(∥⋅−μ∥K) for some K∈K, μ∈Rp and ϕ∈Φ is upper semi-continuous, and is log-concave by Proposition 1(iv). Moreover, writing ϕ−1(s):=sup{r∈[0,∞):ϕ(r)≥s} for s∈(−∞,log∥f∥∞), we have
[TABLE]
for all t∈(0,∥f∥∞) by Proposition 1(ii) and Proposition 1(iv). Hence f is homothetic, as required.
Conversely, suppose that f is an upper semi-continuous, homothetic and log-concave density on Rp, so there exist a decreasing function r:(0,∥f∥∞)→[0,∞), a set A⊆B(Rp) with 0∈int(A) and μ∈Rp such that {x:f(x)≥t}=r(t)A+μ for every t∈(0,∥f∥∞). Then in particular, A∈K. Since 0<λp({x:f(x)>0})=limn→∞λp({x:f(x)≥1/n}), there exists t∗∈(0,∥f∥∞) such that r(t∗)>0. Thus, λp(r(t∗)A)=r(t∗)pλp(A)>0 and λp(r(t∗)A)≤(1/t∗)∫r(t∗)A+μf(x)dx≤1/t∗<∞. By replacing A with r(t∗)A and r(⋅) with r(⋅)/r(t∗), we may therefore assume without the loss of generality that r(t∗)=1.
We now claim that r is left continuous. To see this, let t∈(0,∥f∥∞) and let (tn∈(0,∥f∥∞))n∈N be a sequence such that tn↗t. Then, 0≤r(t)≤limtn↗tr(tn)≤r(tn0) for any n0∈N. Since A∈K, we have
[TABLE]
Since λp(A)<∞ and A∈K, we have limtn↗tr(tn)≤r(t), so r(t)=limtn↗tr(tn). We have thus shown that r is left continuous and may define r−1:[0,∞)→[0,∥f∥∞] by r−1(u):=sup{t∈[0,∥f∥∞):r(t)≥u} for u∈[0,∥r∥∞) and, if ∥r∥∞<∞, we define r−1(∥r∥∞):=sup{t∈[0,∥f∥∞):r(t)≥∥r∥∞} and r−1(u)=0 for any u∈(∥r∥∞,∞). Notice that for any u∈[0,∞) and any t∈(0,∥f∥∞), we have r(t)≥u if and only if r−1(u)≥t.
We now set K:=A and ϕ(s):=logr−1(s) for s∈[0,∞) (with the convention that log0:=−∞). Then the function x↦eϕ(∥x−μ∥K) is well-defined on Rp and moreover for any u∈[0,∞) and any t∈(0,∥f∥∞), we have
[TABLE]
We therefore conclude that f(⋅)=eϕ(∥⋅−μ∥K), and hence ϕ∈Φ, as desired.
Now suppose that f(⋅)=eϕ(∥⋅−μ∥K)=eϕ~(∥⋅−μ~∥K~) for some K~∈K, μ~∈Rp and ϕ~∈Φ. Suppose further that f is not a uniform density and that ∥f∥∞=eM for some M∈R. Then, by the log-concavity of f, there exist c<c′<M such that f−1({ec})=∅ and f−1({ec′})=∅. Let a:=sup{r≥0:ϕ(r)≥c}, a~:=sup{r≥0:ϕ~(r)≥c}, and note that a,a~>0. If x∈Rp satisfies ϕ(∥x−μ∥K)≥c, then ∥x−μ∥K≤a and thus x∈aK+μ. If on the other hand x∈aK+μ, then ϕ(∥x−μ∥K)≥ϕ(a)≥c since ϕ is upper semi-continuous. Thus,
[TABLE]
By the same reasoning, {x:f(x)≥ec}=a~K~+μ~ and we therefore have that K~=(a/a~)K+(μ−μ~)/a~. But, writing a′:=sup{r≥0:ϕ(r)≥c′} and a~′:=sup{r≥0:ϕ~(r)≥c′}, we also have K~=(a′/a~′)K+(μ−μ~)/a~′, and moreover, a~′=a~. We deduce that μ~=μ, and aa~′=a′a~), that K~=(a/a~)K and that ϕ~(r)=ϕ((a/a~)r) for all r∈[0,∞), as required.
If f is a uniform density, then there must exist r0>0 and s0∈R such that ϕ(r)=s0 for r∈[0,r0] and ϕ(r)=−∞ for r>r0. Similarly, there exist r~0>0 and s~0∈R such that ϕ~(r)=s~0 for r∈[0,r~0] and ϕ~(r)=−∞ for r>r~0. It follows that ∥x−μ~∥K~≤r~0 if and only if ∥x−μ∥K≤r0, so r~0K~+μ~=r0K+μ. We conclude that K~=(r0/r~0)K+(μ−μ~)/r~0 and ϕ~(r)=ϕ((r0/r~0)r) for all r∈[0,∞), as required.
∎
For n∈N, define Kn:=K∖(1−1/n)K and let Zn be a random vector, independent of R, distributed uniformly on Kn. We claim that Zn→dZ/∥Z∥K as n→∞. To see this, let Zn′ be a random vector distributed uniformly on (1−1/n)K and let Wn be a Bernoulli(qn,p) random variable, where qn,p:=1−(1−1/n)p, independent of (Zn,Zn′). Since Z is uniformly distributed on K, we have that Z=dWnZn+(1−Wn)Zn′ and thus ∥Z∥KZ=dWn∥Zn∥KZn+(1−Wn)∥Zn′∥KZn′.
We observe that Zn′/∥Zn′∥K=dZ/∥Z∥K since Zn′=d(1−1/n)Z. Now, writing ψZ/∥Z∥K and ψZn/∥Zn∥K for the characteristic functions of Z/∥Z∥K and Zn/∥Zn∥K respectively, we have that ψZ/∥Z∥K(t)=qn,pψZn/∥Zn∥K(t)+(1−qn,p)ψZ/∥Z∥K(t) for all t∈Rp. We deduce that Zn/∥Zn∥K=dZ/∥Z∥K. Since
[TABLE]
it follows that Zn→dZ/∥Z∥K, as claimed.
Define Xn:=ZnR, so that Xn has density
[TABLE]
for any x∈Rp. We deduce that whenever x∈Rp is non-zero and such that ϕ is continuous at ∥x∥K,
[TABLE]
Thus, since ϕ is continuous Lebesgue almost everywhere, by Scheffé’s lemma, Xn converges in distribution to a random variable X with density f. We conclude that (Z/∥Z∥K)R=dX, as desired.
∎
Remark:
Alternatively, we may prove the first claim in Proposition 3 by defining, for any t∈(0,1), an operator At:K\{0}→K\tK of the form A_{t}(x)=\bigl{\{}1-t^{p}+\bigl{(}t/\|x\|_{K}\bigr{)}^{p}\bigr{\}}^{1/p}x and then showing that At(Z) is uniformly distributed on K\tK for any t∈(0,1). One can then show that if Zn∼Unif(Kn) then Zn→dZ/∥Z∥K since limt↗1At(Z)=Z/∥Z∥K.
Remark:
In fact, from this proof, we see that Proposition 3 holds more generally whenever K is compact and star-shaped at the interior point [math], and ϕ is continuous Lebesgue almost everywhere.
(i) Fix ϕ∈Φa. Observe that if limr→∞ϕ(r)=c>−∞, then L(ϕ,Q)≤ϕ(a)−ec∫a∞rp−1dr=−∞. Otherwise limr→∞ϕ(r)=−∞, and then there exist α>0,β∈R such that ϕ(r)≤−αr+β. Hence,
[TABLE]
(ii) Now suppose that Q({a})=1 and let eϕn(r):=n\mathbbm1{r∈[a,a+n−1]}. Then,
[TABLE]
as n→∞.
(iii) Finally, suppose that Q∈Qa. For ϕ(r):=−r, we have
[TABLE]
so supϕ∈ΦaL(ϕ,Q)>−∞.
For δ,ϵ>0, let \mathcal{Q}_{a}(\delta,\epsilon):=\bigl{\{}Q\in\mathcal{Q}_{a}:Q\bigl{(}(a+\delta,\infty)\bigr{)}>\epsilon\bigr{\}}. Then, since Q({a})<1, we have Q∈Qa(δ,ϵ) for some δ,ϵ>0. We also write M:=ϕ(a) and M′:=ϕ(a+δ). Then by the concavity of ϕ,
[TABLE]
If M>0 and (M−M′)ϵ≤2M, then
[TABLE]
On the other hand, if M>0 and (M−M′)ϵ>2M, then from (S1.2) we see that L(ϕ,Q)≤−M+1. We deduce that there exists M∗>0, depending only on δ, ϵ and p, such that
[TABLE]
The existence of ϕ∗ then follows from the proof of Theorem 2.2 in Dümbgen et al. (2011).
(iv) By the change of variable formula (e.g. Billingsley, 1995, Theorem 16.13), we have ∫Rpϕ(∥x∥)dP(x)=∫[0,∞)ϕ(r)dQ(r) for all ϕ∈Φ. The result then follows from (iii), specialised to the case a=0.
∎
(i) For any ϕ∈Φa, we may define ϕα∈Φαa by ϕα(r):=ϕ(r/α)−plogα. The map ϕ↦ϕα is a bijection from Φa to Φαa. Let ϕ∗=ϕa∗(Q). Then, for any ϕ∈Φa, we have
[TABLE]
This establishes that ϕα∗=ϕαa∗(Qα) and thus proves scale equivariance.
(ii) For any t>0, we have
[TABLE]
Choose t0>0 small enough that ϕ∗+t0Δ∈Φa. Since pλp(K)∫a∞rp−1eϕ∗(r)dr<∞, we must have ϕ∗(r)→−∞ as r→∞, and hence, by reducing t0>0 if necessary, we may assume that pλp(K)∫a∞Δ(r)rp−1eϕ∗(r)+t0Δ(r)dr<∞. Now, for t∈(0,t0],
[TABLE]
Hence, if ∫a∞Δ(r)h∗(r)dr>−∞, then we may apply the dominated convergence theorem to (S1.3) to take the limit as t↘0 and reach the desired conclusion. On the other hand, if ∫a∞Δ(r)h∗(r)dr=−∞, then for every t∈(0,t0],
[TABLE]
The result follows.
(iii) Letting Δ(r)=−r, this is a consequence of (ii).
(iv) Letting \Delta(r)=\log\bigl{(}h_{0}(r)/h^{*}(r)\bigr{)}, this also follows from an application of (ii).
∎
The proof is very similar to (in fact, somewhat more straightforward than) the proof of Dümbgen et al. (2011, Theorem 4.5), so we focus on the main differences. We first observe that if Xn∼Pn and X∼P are defined on the same probability space, then
[TABLE]
Taking ϵ>0 such that Bp(0,ϵ)⊆K, we have that supx=0∥x∥K/∥x∥≤1/ϵ<∞. Hence, writing Qn and Q for the distributions of ∥Xn∥ and ∥X∥ respectively, we deduce that dW(Qn,Q)≤dW(Pn,P)/ϵ→0. It follows that ∫0∞rdQn(r)→∫0∞rdQ(r)<∞, and limsupn→∞Qn({0})≤Q({0})<1, so Qn∈Q for n≥n0, say. For such n, we write ϕn∗:=ϕ∗(Qn) and ϕ∗:=ϕ∗(Q). Let n0≤n1<n2<… be an arbitrary, strictly increasing sequence of positive integers. By extracting a further subsequence if necessary, we may assume that L(ϕnk∗,Qnk)→α∈[−∞,∞]. First note that, by considering the function ϕ(r)=−r,
[TABLE]
Our next claim is that limsupk→∞supr∈[0,∞)ϕnk∗(r)<∞. To see this, recall the definition of the classes Q(δ,ϵ)≡Q0(δ,ϵ) from the proof of Proposition 4, and let δ0,ϵ0>0 be such that Q∈Q(δ0,ϵ0). Since \liminf_{k\rightarrow\infty}Q_{n_{k}}\bigl{(}(\delta_{0},\infty)\bigr{)}\geq Q\bigl{(}(\delta_{0},\infty)\bigr{)}>\epsilon_{0}, we see from the proof of Proposition 4 that our claim follows. This means that α<∞.
Let r_{0}:=\sup\bigl{\{}r\in[0,\infty):Q\bigl{(}[0,r)\bigr{)}<1\bigr{\}}. Our next claim is that liminfk→∞ϕnk∗(r)>−∞ for all r∈[0,r0). To see this, note by our first claim that we may assume without loss of generality that there exists M∗≥max(α,0) such that supk∈Nsupr∈[0,∞)ϕnk∗(r)≤M∗. Then, for any r∈[0,r0),
[TABLE]
Since Q\bigl{(}(r,\infty)\bigr{)}>0, we deduce that
[TABLE]
as required.
These two claims allow us to extract a further subsequence (ϕnk(ℓ)∗) that converges in an appropriate sense to a limit ϕ∗∈Φ (in particular, this convergence occurs Lebesgue almost everywhere). It turns out that ϕ∗=ϕ∗(Q), that L(ϕnk(ℓ)∗,Q)→L(ϕ∗,Q), and, writing fℓ∗:=f∗(Pnk(l)) , we have ∫Rp∣fℓ∗−f∗∣→0. The desired total variation convergence (6) follows. See the proof of Theorem 4.5 of Dümbgen et al. (2011) for details.
For the final claim, note that our previous argument allows us to conclude that f∗(Pn) converges to f∗(P) Lebesgue almost everywhere. The conclusion therefore follows from Rockafellar (1997, Theorem 10.8).
∎
Let h0 denote the density of ∥X1∥K, so that h0(r)=pλp(K)rp−1eϕ(r) for some ϕ∈Φ by Proposition 3. Let Zi:=∥Xi∥K for i∈[n], and write Qn for the empirical distribution of Z1,…,Zn. Now let h^n:=h∗(Qn), so that h^n(r)=pλp(K)rp−1eϕ^n(r), where ϕ^n:=ϕ∗(Qn)=φ∗(Pn), by Proposition 4(iv). Then h^n(Zi)/h0(Zi)=f^n(Xi)/f0(Xi) for i∈[n], so dX2(f^n,f0)=dX2(h^n,h0). By scale equivariance of ϕ∗ (Proposition 5(i)), together with the scale invariance of the loss function, we may assume without loss of generality that σh0=1. By Lemma S5, there exist universal constants Cμ,Cσ,C>0 such that \mathbb{P}\bigl{(}\hat{h}_{n}\notin\mathcal{H}_{0}(h_{0},C_{\mu},C_{\sigma})\bigr{)}\leq C/n. Moreover, by Lemma S8, there exists a universal constant K>0 such that for every δ>0,
[TABLE]
Define Ψ(δ):=max(Kδ3/4,δ), so that δ↦Ψ(δ)/δ2 is decreasing. By choosing δ∗:=K0n−2/5 for a suitably large universal constant K0>0, we may apply Kim et al. (2018b, Theorem 10) (a minor restatement of van de Geer (2000, Corollary 7.5)), to deduce that there exists a universal constant K∗>0 such that for n≥8,
[TABLE]
where, to obtain the final inequality, we have applied Lemmas S9 and S10.
∎
Fix h∗∈H(k) where h∗(r)=pλp(K)rp−1eϕ∗(r) and ϕ∗∈Φ(k), let I1,…,Ik be the k intervals on which ϕ∗ is affine, with Ij=[aj−1,aj], and let r_{0}:=\sup\bigl{\{}r\in[0,\infty):\phi_{*}(r)>-\infty\bigr{\}}. We work throughout on the probability 1 event that {Z1,…,Zn}∩{a0,a1,…,ak}=∅. Define Zj:={i:Zi∈Ij} and nj:=∣Zj∣. Let J:={j∈[k]:nj≥8} be the set of indices of intervals with at least eight data points, and let Jc:=[k]∖J. Define Φ~ to be the set of upper semi-continuous functions ϕ:[0,∞)→[−∞,∞) such that \phi\big{|}_{I_{j}} is decreasing and concave for each j∈[k], and such that ϕ(r)=−∞ for r>r0. Note that a function ϕ∈Φ~ need not be globally decreasing and, in fact, need not be continuous on [0,r0].
Given any ϕ∈Φ~ and j∈J, let ϕ(j)(r):=ϕ(r)+log(n/nj) for r∈Ij, and let ΦIj:={ϕ∣Ij:ϕ∈Φ}. Now, for j∈J, define
[TABLE]
and let ϕ~n(r):=ϕ~n(j)(r)−log(n/nj) whenever r∈Ij for some j∈J, and ϕ~n(r):=−∞ otherwise.
Then, for any ϕ∈Φ~,
[TABLE]
Arguing similarly to the second paragraph of Section 2.3, it follows that the function h~n defined by h~n(r):=pλp(K)rp−1eϕ~n(r) is a density. Moreover, for j∈J, the function h~n(j):=njnh~n∣Ij is a density. Writing pj:=∫Ijh0, and h0(j):=pj1h0∣Ij, we deduce from (S2) that
[TABLE]
Now, since nj∼Bin(n,pj) and logx≤x−1 for x>0, we have
[TABLE]
For the third term in (S2), we have by Lemmas S9 and S10, that for n≥8 (cf. (S2)),
[TABLE]
Finally, to bound the first term in (S2), for j∈[k], let us first define qj:=∫Ijh∗(r)dr, h∗(j):=h∗/qj, ν∗,j2:=dH2(h0(j),h∗(j)), ν∗2:=∑j=1kpjν∗,j2, J0:={j∈J:njν∗,j2≥1} and temporarily assume that k≤e−1/4n. Note that h∗(j)∈Haj−1(1), and h0(j)≪h∗(j) for j∈J. It follows by Proposition 9, applied conditionally on n1,…,nk, a simple extension of Jensen’s inequality using the fact that x↦log5/4x is concave on [e1/4,∞) (e.g. Han et al., 2018, Lemma 2) and the fact that k↦klog5/4(en/k) is increasing for k∈[1,e−1/4n],
[TABLE]
To bound the first term of (S2), observe that by two applications of Jensen’s inequality,
[TABLE]
But
[TABLE]
Moreover, by three further applications of Jensen’s inequality, and using the fact that x↦log5x is concave for x≥e5, we have
[TABLE]
where the last step follows from the fact that x↦x1/5logx2e5 is increasing for x≤2. Combining (S2), (S2),(S2.7), (S2), (S2), (S2) and (S2), the result follows in the case k≤e−1/4n.
By the scale equivariance described in Proposition 5(i), we may assume without loss of generality that σh0=1. Define \nu:=\inf\bigl{\{}d_{\mathrm{H}}(h_{0},h)\,:\,h\in\mathcal{H}_{a}^{(1)},h_{0}\ll h\bigr{\}}\in[0,2^{1/2}]. By Lemma S11, if δ∈(0,2−9−ν), then for every ϵ>0, it holds that
[TABLE]
On the other hand, if δ≥2−9−ν, then by Lemma S8, for every ϵ>0, we have that
[TABLE]
It follows that
[TABLE]
Define Ψ(δ):=Cδ3/4(δ+ν)1/4{log5/8(1/δ)∨1}, where the universal constant C>0 is chosen such that
[TABLE]
Set \delta_{n}:=K\bigl{\{}\frac{\nu^{2/5}}{n^{4/5}}\log\frac{en}{\nu}+\frac{1}{n}\log^{5/4}(en)\bigr{\}}^{1/2} for a universal constant K>0 to be chosen later. Then, because Ψ(δ)/δ2 is non-increasing, we have
[TABLE]
By choosing the universal constant K>0 sufficiently large, we can ensure that this ratio is larger than the universal constant required to apply Theorem 10 in the online supplement of Kim et al. (2018b) (a minor restatement of van de Geer (2000, Corollary 7.5)). We deduce from this result that there exists a universal constant C>0 such that for δ≥δn,
First, we note that since K^,μ^ depend only on Xn+1,…,X2n, they are independent of X1,…,Xn. Moreover, since f0∈FpK,μ, we may, by Proposition 2, rescale K if necessary to assume without loss of generality that
[TABLE]
Once we prove the proposition with this assumption, the more general conclusion follows immediately from the fact that both ∥μ^−μ∥K/Ef01/2(∥X1−μ∥K2) and infα>0dscale(αK^,K) remain unchanged if we rescale K. Under assumption (S3.14), the event Ec1,c2 defined in (8) then takes the form
[TABLE]
and we choose universal constants c1,c2>0 small enough that the conclusions of Lemmas S20, S21, S25, S26, S27, S31, S32, and S33 hold.
Note that, by Proposition 5(i), h^n is scale equivariant. Thus, f^n is by construction also scale equivariant and consequently, we may rescale K^ if necessary to assume that, on event Ec1,c2, we have that dscale(K^,K)<c2. Now let a_{n}:=n^{-4/5}+d_{\textrm{KL}}^{2}(\tilde{h}_{n},\tilde{h}_{0})+d_{\textrm{H}}^{2}(\check{f}_{n},f_{0})+d_{\textrm{H}}^{2}\bigl{(}\tilde{h}_{0},h_{0}\frac{\lambda_{p}(\hat{K})}{\lambda_{p}(K)}\bigr{)}. By Lemmas S20, S21, S25, S26, and S27, together with the fact that dH2(f,g)≤dTV(f,g) for all densities f,g, there exists universal constants Cμ>0,Cσ>1,C>0 such that on the event Ec1,c2,
[TABLE]
where τ∗:=supx∈Rp∖{0}∥x∥K^/∥x∥K and τ∗:=infx∈Rp∖{0}∥x∥K^/∥x∥K. We now claim that (τ∗−τ∗)≤2dscale(K^,K). To see this, fix any ϵ>dscale(K^,K) and x∈Rp∖{0}. Then,
[TABLE]
Since ϵ>dscale(K^,K) and x∈Rp∖{0} were chosen arbitrarily, we have that τ∗−1≤dscale(K^,K) and that 1−τ∗≤dscale(K^,K) as desired. The first part of the proposition then follows.
The second claim follows from the same argument except that we apply Lemmas S31, S32, and S33 in the final inequality in (S3).
Finally, for the third claim of the proposition, we define
[TABLE]
We may then apply the second claim of Lemma S21 and Lemmas S20, S31, S32, and S33 in the final inequality in (S3) to obtain the desired conclusion.
∎
Since K^ is location invariant, μ^ is location equivariant, and ∥μ^−μ∥K is also location invariant, we assume without loss of generality that μ=0. Moreover, K^ is scale equivariant in the sense that if Σ~∈Sp×p and we define X~i:=Σ~1/2Xi for i∈[n], let Σ^′:=n−1∑i=1nX~iX~i⊤ and K^′:=Σ^′1/2K0, then K^′=Σ~1/2K^. Since infα>0dscale(αK^,K) is also scale invariant in the sense that infα>0dscale(αK^,K)=infα>0dscale(αΣ~1/2K^,Σ~1/2K) for any Σ~∈Sp×p, we assume without loss of generality that Σ=Ip. Thus, there exists α0>0 such that K=α0K0.
For each j∈[p], the random variable X1j has a univariate log-concave density with mean [math] and variance 1. By e.g. Feng et al. (2018, Proposition S2(iii)), we have E(∣X1j∣k)≤2ek! for all integers k≥2. Then, by Bernstein’s inequality, there exists a universal constant C1>0 such that
[TABLE]
Let E1:={∥μ^∥≤C1(p/n)1/2log1/2(en)}, so that P(E1c)≤p/n2≤n−1.
By Adamczak et al. (2010, Theorem 4.1), there exists a universal constant C2>0 such that, with probability at least 1−1/n,
[TABLE]
Let E2 denote the event that (S3.17) holds. We work on the event E1∩E2 for the rest of this proof, so that P(E1c∪E2c)≤2/n.
Now
[TABLE]
for some universal constant C>0.
Fix an arbitrary x∈Rp with ∥x∥α0K^=1. Then ∥Σ^−1/2x∥K=1 since α0K^=α0Σ^1/2K0=Σ^1/2K. Thus,
[TABLE]
Hence α0K^⊆(1+r0∥Σ^1/2−Ip∥op)K. Likewise,
[TABLE]
where the final inequality follows from (S3.18) and the assumption that Cr0nplog3(en)≤1/2. Thus, we also have that K⊆(1+2r0∥Σ^1/2−Ip∥op)α0K^. Therefore, by (S3.18) again,
[TABLE]
Moreover, since Σ=Ip and Bp(0,r1)⊆K⊆Bp(0,r2), we find that
Algorithm 1 is scale equivariant in the sense that for any α>0, if we let Xi′:=αXi for i∈[n+M] and let K^,K^′ be the resulting outputs of Algorithm 1 on inputs {Xi}i=1n+M and {Xi′}i=1n+M respectively, then K^′=αK^. Since the left-hand side of (10) is invariant to scaling of K^, we assume without loss of generality that Ef0(∥X1∥2)=1. We also assume that Ef0(∥X1∥K)=1, which can be done without loss of generality by Proposition 2 and the fact that the left-hand side of (10) is invariant to the scaling of K.
Define \tilde{K}:=\mathrm{conv}\bigl{\{}\frac{\theta_{1}}{\|\theta_{1}\|_{K}},\ldots,\frac{\theta_{m}}{\|\theta_{m}\|_{K}}\bigr{\}}, and define the events
[TABLE]
Under the assumption that r02n−p+11log2(en)≤1/64, we have that M/logM>r02(p−1)64p−1. Thus, on the event E∗∩E~∗, we have that dscale(K~,K)<1 and that, by Lemma S46,
[TABLE]
By Lemmas S43 and S36, it holds that P(E∗∩E~∗)≥1−Cp,r0n−p/(p+1) for some Cp,r0>0, depending only on p and r0, as desired.
∎
Our first result, amongst other things, reveals the density of the random variable ∥X−μ∥K when X has a density belonging to FpK.
Lemma S1**.**
Let μ∈Rp, K∈K, and let g:[0,∞)→R be integrable. Let B∈B([0,∞)) and let A:={x∈Rp:∥x−μ∥K∈B}. Then
[TABLE]
Proof.
By transforming x−μ to x if necessary, we may assume without the loss of generality that μ=0. Define a measures μ1K,λ1K on \bigl{(}[0,\infty),\mathcal{B}([0,\infty))\bigr{)} by μ1K(B′):=λp({x∈Rp:∥x∥K∈B′}) and λ1K(B′):=pλp(K)∫B′rp−1dr for B′∈B([0,∞)). We claim that μ1K=λ1K. To prove the claim, first consider the case where B′=[0,c) for some c>0. Then
[TABLE]
For n∈N, define \mathcal{B}_{n}^{\prime}:=\bigl{\{}B^{\prime\prime}\in\mathcal{B}([0,n))\,:\,\mu_{1}^{K}(B^{\prime\prime})=\lambda_{1}^{K}(B^{\prime\prime})\bigr{\}}. Now μ1K,λ1K are finite measures on [0,n), and Bn′ is a σ-algebra that contains {[0,c):c∈(0,n]} by (S4.19). Hence Bn′=B([0,n)). For a general B′∈B([0,∞)), we have that
[TABLE]
and we deduce that μ1K=λ1K, as claimed.
Now suppose that g is a non-negative measurable function, fix B∈B([0,∞)) and A={x∈Rp:∥x∥K∈B}. Let s1≤s2≤… be a sequence of non-negative, simple functions on [0,∞) such that sk↗g, so that ∫Ask(∥x∥K)dx=pλp(K)∫Brp−1sk(r)dr by our claim. By two applications of the monotone convergence theorem, we conclude that ∫Ag(∥x∥K)dx=pλp(K)∫Brp−1g(r)dr. The case where g is integrable can be handled by applying this result to the positive and negative parts of g.
∎
The aim of the next three results is to elucidate the way in which the first two moments of the empirical distribution Qn of a set of n data points in [0,∞) change under the projection ha∗. These results enable us to show that if the data are drawn independently from a common distribution on [0,∞), then with high probability, the first two moments of h^n:=ha∗(Qn) are close to their population analogues.
Our first lemma concerns bounds on μh^n, and is expressed in terms of the function ρ≡ρa,p:[0,∞)→(0,∞) defined by
[TABLE]
Basic properties of the function ρ are given in Lemma S49.
Lemma S2**.**
Fix a≥0, and suppose that Z1,…,Zn are real numbers in the interval [a,∞) that are not all equal to a. Let Qn be the empirical distribution corresponding to Z1,…,Zn. Let h^n:=ha∗(Qn), so that h^n(r)=pλp(K)rp−1eϕ^n(r) for r≥a, for some ϕ^n∈Φ. Then, writing Zˉ:=n−1∑i=1nZi, as well as r0:=sup{r∈[a,∞):ϕ^n(r)=ϕ^n(a)} and s0:=r0−a, we have
By (5), we have that r0=Zi for some i because Qn is an empirical distribution. Moreover, the right derivative of ϕ^n at r0 is strictly negative. Hence, by Proposition 5(ii), applied to the functions Δ(r):=±(r−r0)+, we have
[TABLE]
Now, since ϕ^n(r)=ϕ^n(a) for all r∈[a,r0], we have
[TABLE]
We deduce that
[TABLE]
where we used Lemma S50 to obtain the final bound. From (S4.2) and (S4.2), we find that
[TABLE]
In particular, μh^n≥Zˉ−ρ(s0)s0.
Now, for i=1,…,n, let Z~i:=min(Zi,s0+a). Then n−1∑i=1n(r0−Zi)+=s0−n−1∑i=1n(Z~i−a)≥0 and n−1∑i=1nZ~i≤Zˉ. Hence
[TABLE]
where the final inequality follows from Lemma S49(iii). The second lower bound for μh^n follows from (S4.23) and (S4.2). The upper bound on μh^n follows from Proposition 5(iii).
∎
We now study bounds for σh^n, and their consequences for supr≥alogh^n(r).
Lemma S3**.**
Let a≥0 and let Q∈Qa. Suppose that there exists A>0 such that \sup\bigl{\{}\frac{Q(D)}{\lambda_{1}(D)}\,:\,D\subseteq\mathbb{R}\,\textrm{compact, convex},\,Q(D)\geq 1/2\bigr{\}}\leq A. Let h∗:=ha∗(Q), and ℓh∗:=∫a∞logh∗dQ. Then there exists a universal constant Cσ>0 such that
Since h∗ is upper semi-continuous, there exists r0≥a such that logh∗(r0)≥ℓh∗. By Lemma S47(ii), we then have that σh∗≤C′e−ℓh∗ for some universal constant C′>0.
To provide the lower bound for Aσh∗, we first prove (S4.25). To that end, let M:=supr≥alogh∗(r) and let t:=eM/(24A). If t<1, then the claim immediately follows, so let us thus assume t≥1. Define DM−t:={r∈[a,∞):logh∗(r)≥M−t} and suppose first that Q(DM−t)<1/2. Then,
[TABLE]
where the final inequality follows because the univariate function g(x)=x−ex is concave and thus g(x)≤g(x0)+g′(x0)(x−x0) for any x,x0∈R; we take x and x0 to be M−log(25A) and log2 respectively. Thus,
[TABLE]
If on the other hand, Q(DM−t)≥1/2, then we may use Dümbgen et al. (2011, Lemma 4.1) and the assumption that t≥1 to obtain
[TABLE]
and thus (S4.26) also follows. Therefore (S4.25) holds in all cases. By Lemma S47(i), there exists a universal constant C′′>0 such that Aσh∗≥C′′min(1,eℓh∗−logA), as desired.
∎
Lemma S4**.**
Let h be a density on R and suppose ∥h∥esssup<∞. Let Z1,…,Zn∼iidh with empirical distribution Qn. We have that
[TABLE]
Proof.
We may assume that n≥8 since the bound is trivially true otherwise, and we also assume that Z1,…,Zn are distinct (an event of probability 1). Let H and H denote the distribution function of h and Qn respectively. Define the event E:={∥H−H∥∞≤1/16} and observe that P(Ec)≤2e−n/128 by the Dvoretzky–Kiefer–Wolfowitz inequality. On the event E, for any a,b∈R with a<b and Qn([a,b])≥1/2, we have that
[TABLE]
Thus, we have that ∥h∥esssup∣b−a∣≥1/4 and hence
[TABLE]
as desired.
∎
We are now in a position to argue that, with high probability, h^n belongs to a subclass of Ha with restricted first two moments. These moment restrictions are important for enabling us to obtain the bracketing entropy bounds that drive the rates of convergence of the K-homothetic log-concave MLE. For Cμ,Cσ>0, a0≥0 and h0∈Ha0, let
[TABLE]
Lemma S5**.**
Let a0≥0, fix a density h0∈Ha0 with σh0=1, and suppose that Z1,…,Zn∼iidh0, with empirical distribution Qn. Writing h^n:=ha0∗(Qn), there exist universal constants Cμ,Cσ,C>0 such that
We may assume that n≥500. Let E:={∣Zˉ−μh0∣≤1}, so that P(Ec)≤1/n, by Chebychev’s inequality. On the event E, we have μh^n≤Zˉ≤μh0+1 by Lemma S2. Recall the definition of s0 from Lemma S2. If s0≤1, then by Lemma S2, on the event E,
[TABLE]
where the middle inequality uses the fact that ρ(s0)≥1 (Lemma S49(iii)). If s0>1, then by Lemma S2, on the event E,
[TABLE]
Hence, if ρ(s0)/s0<2−7, then by Lemma S6, we have μh^n−μh0≥−2−212≥−213. On the other hand, if ρ(s0)/s0≥2−7, then by Lemma S2,
[TABLE]
It follows that there exists a universal constant Cμ>0 such that
[TABLE]
To bound σh^n, define the event E^{\prime}:=\bigl{\{}\int_{a_{0}}^{\infty}\log\hat{h}_{n}\,d\mathbb{Q}_{n}\geq-3\bigr{\}}. By Lemma S48 and Bobkov and Madiman (2011, Theorem 1.1), for n≥500,
[TABLE]
Define event E^{\prime\prime}:=\bigl{\{}\sup\{\frac{\mathbb{Q}_{n}(C)}{\lambda_{1}(C)}\,:\,C\textrm{ compact, convex},\,\mathbb{Q}_{n}(C)\geq 1/2\}\leq 2\|h_{0}\|_{\infty}\bigr{\}}. By Lemma S4, we have that P(E′′)≥1−2e−n/128. On the event E′∩E′′, by Lemma S3 and S47, there exists a universal constant Cσ≥1 such that Cσ−1≤σh^n≤Cσ. The desired result follows from a union bound.
∎
The mean μh of any h∈Ha is constrained because h(r)=pλp(K)rp−1eϕ(r) for r≥a and some decreasing function ϕ. The next lemma formalises this notion.
Lemma S6**.**
Let a≥0 and let h∈Ha. Then, writing s∗:=sup{s∈(0,∞):sρa/σh,p(s)≥2−7}, we have
[TABLE]
Remark:
Even though we cannot obtain an analytic expression for ρ(s∗), we can apply the bounds developed in Lemma S49 to control μh. For example, since ρ(s)≤p for any a≥0 by Lemma S49(iii), we have that μh−a≲σhp. When a=0, this bound is sharp up to the universal constant, because taking eϕ(r)=pb−p\mathbbm1{r∈[0,b]}, where b=(p+1)(p+2)1/2/p1/2, yields μh=p1/2(p+2)1/2 and σh=1.
Proof.
We initially assume that σh=1 and write ρ(⋅)=ρa,p(⋅). If h∈Ha, then h(r)=0 for r∈(−∞,a) and thus, μh≥a. For the upper bound on μh, we first observe that h(μh)≥2−7 by Lemma S47. Since ρ(s)/s is decreasing by Lemma S49(ii) and 1≤ρ(s)≤p by Lemma S49(iii), we have that s∗∈(0,∞). Suppose for a contradiction that μh>a+212ρ(s∗). Then, since ρ is increasing by Lemma S49(i), we have ρ(s)/(μh−a)<2−12 for all s≤s∗. Moreover, by definition of s∗, we have ρ(s)/s<2−7 for all s>s∗. Hence \sup_{s\in(0,\infty)}\min\bigl{(}32\frac{\rho(s)}{\mu_{h}-a},\frac{\rho(s)}{s}\bigr{)}<2^{-7}. But then Lemma S7 establishes a contradiction, so μh≤a+212ρ(s∗), as desired. For general σh∈(0,∞), we apply the above argument to g(⋅):=σhh(σh⋅), which satisfies σg=1 and g∈Ha/σh.
∎
For any a≥0, h∈Ha and any s∈(0,∞),
For any a≥0, h∈Ha,
[TABLE]
Proof.
Let us fix h∈Ha and define h~(s):=h(a+s). Observe that h~ is a density of the form h~(s)=pλp(K)(a+s)p−1eϕ(s) for some ϕ∈Φ0. We will show that h~(s′)≤ρ(s′)/s′ for all s′∈(0,∞) and if s0∈[0,∞) is such that h~(s0)=sups∈[0,∞)h~(s), then h~(s0)≤32ρ(s0)/μh~ (such an s0 exists by the upper semi-continuity of ϕ). The lemma then follows since μh~=μh−a.
To this end, fix s′∈(0,∞) and define h~s′(s):=pλp(K)α(a+s)p−1\mathbbm1{s∈[0,s′]} where α−1:=pλp(K)∫0s′(a+s)p−1ds=λp(K){(a+s′)p−ap}. Then
[TABLE]
Hence h~(s′)≤h~s′(s′)=pλp(K)α(a+s′)p−1=ρ(s′)/s′, as desired.
To prove the second claim, fix s0∈[0,∞) such that h~(s0)=sups∈[0,∞)h~(s). Observe that if s0≥μh~/32, then ρ(s0)/s0≤32ρ(s0)/μh~ and lemma follows. We may therefore assume that s0∈[0,μh~/32), define M:=logh~(s0) and fix m∈(−∞,M−2]. For t∈[m,M], let Dt:={s∈[0,∞):logh~(s)≥t}. Since h~ is itself a log-concave density, we have that, any t∈[m,M] and s∈Dm,
[TABLE]
Hence
[TABLE]
Using Fubini’s theorem as in Dümbgen et al. (2011, Lemma 4.1), we can now compute
[TABLE]
Since Dm is an interval containing s0, we conclude that logh~(s)≤m whenever ∣s−s0∣>2(M−m)e−M. Thus, for ∣s−s0∣>4e−M, we have
[TABLE]
Assume for the sake of contradiction that s0−4e−M>0. Then
[TABLE]
This is a contradiction since we assumed that μh~≥32s0. Thus s0−4e−M≤0. We deduce that
[TABLE]
Since s0≤μh~/32, we obtain eM≤13ρ(s0)/μh~ as desired.
∎
The next lemma is a very slight generalisation of (Kim and Samworth, 2016, Theorem 4) and can be proved in the same manner, with minor modifications to handle the general mean and variance perturbation. The proof is omitted for brevity.
Fix Cμ>0,Cσ≥1 and h0∈Ha. There exists C>0, depending only on Cμ,Cσ, such that for every ϵ>0,*
[TABLE]
Lemma S9**.**
Let a≥0 and let Z1,…,Zn∼iidh0∈Ha with σh0=1 and empirical distribution Qn. Let h^n:=h∗(Qn). Then for n≥8 and t≥8,
[TABLE]
Proof.
Let Qn denote the empirical distribution of Z1,…,Zn and define the event
[TABLE]
Observe that E1 occurs only if, for every closed interval D of length at most 25n−t/4, it holds that Qn(D)<1/2. Since ∥h0∥∞≤1 by Feng et al. (2018, Proposition S2(iii)), we have that for n,t≥8,
[TABLE]
Now let E_{2}:=\bigl{\{}\int_{a}^{\infty}\log h_{0}\,d\mathbb{Q}_{n}>-(t/2)\log n+4\log 2-2\bigr{\}}, so by Lemma S48 and Bobkov and Madiman (2011, Theorem 1.1),
[TABLE]
Applying Lemma S3 with h∗=h^n, A=nt/4/24 and observing that ℓh∗≥∫a∞logh0dQn≥−(t/2)logn+4log2−2 on E1∩E2, we find that on this event, for n,t≥8,
[TABLE]
as desired.
∎
Lemma S10**.**
Let a≥0, let h0∈Ha with σh0=1, and suppose that Z1,…,Zn∼iidh0. Writing Z(1):=miniZi and Z(n):=maxiZi, there exists a universal constant C>0 such that for t≥4,
[TABLE]
Proof.
This result follows from the proof of Kim et al. (2018b, Lemma 2).
∎
S4.2.1 Local bracketing entropy results
For h0∈Ha and δ>0, let
[TABLE]
Recall from the introduction that Fp denotes the class of all upper semi-continuous, log-concave densities on Rp. For f0∈F1, we define
[TABLE]
where we adopt the convention that 0/0:=0.
Lemma S11**.**
Fix a≥0 and let h0∈Ha. Assume that ν:=inf{dH(h0,h):h∈Ha(1),h0≪h}<2−9. Then there exists a universal constant C>0 such that, for all δ∈(0,2−9−ν) and all ϵ>0,
[TABLE]
Proof.
Fix h1∈Ha(1) with h0≪h1 and let ν1:=dH(h0,h1). Then, by the triangle inequality,
[TABLE]
Since h1(r)=pλp(K)rp−1eϕ1(r) where ϕ1∈Φ(1), we have that log(h/h1) is concave for any h∈H(h1,δ+ν1). We therefore have
[TABLE]
where the right-hand side is defined in (S4.28). It then follows from this and Lemma S13 that
[TABLE]
Since the choice of h1∈Ha(1) with h0≪h1 was arbitrary, the bound continues to hold when ν1 is replaced with ν.
∎
Recall that F1 denotes the set of all upper semi-continuous log-concave densities on R.
Lemma S12**.**
Let f,g∈F1. Then there exist universal constants Cμ′>0,Cσ′≥1 such that if dH(f,g)≤2−7, then
[TABLE]
Proof.
Since the Hellinger distance is affine invariant, we may assume without loss of generality that μf=0 and σf=1. By Lovász and Vempala (2007, Theorem 5.14(a) and (d)), f(x)≥2−8 for x∈[−1/9,1/9]. We claim that g(x)≥2−12 for some x∈[−1/9,1/9]. To see this, suppose for a contradiction that g(x)<2−10 for all x∈[−1/9,1/9]. Then
[TABLE]
a contradiction. By Lemma S47, it follows that σg≤Cσ′ for some universal constant Cσ′>0. The lower bound on σg follows by symmetry.
Now assume without loss of generality that μg≥0. By the first part and Feng et al. (2018, Proposition S2(iii)), g(x)≤Cσ′e−μg/Cσ′+1 for all x≤0. It follows that if μg≥Cσ′(1+10log2+logCσ′), then
[TABLE]
a contradiction. The result follows.
∎
We now prove a general result on the local bracketing entropy of log-concave densities.
Lemma S13**.**
Let δ∈(0,2−8] and f0∈F1. Then there exists a universal constant C>0 such that, for every ϵ>0,
[TABLE]
Proof.
In this proof, we let C>0 be a generic universal constant whose value may vary from instance to instance. For a set S⊆R, we define F~(f0,δ,S):={f∣S:f∈F~(f0,δ)} and abuse notation slightly to define
H[](ϵ,F~(f0,δ),dH,S):=H[](ϵ,F~(f0,δ,S),dH,S). We note that, for any ϵ1,ϵ2>0 and disjoint Borel measurable sets S1,S2⊆R,
[TABLE]
Since dH is location and scale equivariant, we assume without loss of generality that μf0=0 and σf0=1. Define aL:=inf{r∈R:f0(r)≥δ2} and aR:=sup{r∈R:f0(r)≥δ2}; it holds by the fact that δ≤2−9 and Lemma S16 that aL≤−1/9 and aR≥1/9. By Lemma S12 and Feng et al. (2018, Proposition S2(iii)), there exist α>0 and β∈R such that for any f∈F~(f0,δ) and x∈R, we have f(x)≤e−α∣x∣+β. Let bL:=−α−1log(eβ/δ4) and bR:=α−1log(eβ/δ4). Then, for any f∈F~(f0,δ), f(r)<δ4 for r∈(−∞,bL)∪(bR,∞), and [aL,aR]⊆[bL,bR].
First, we will bracket the region [bL,aL)∪(aR,bR]. To this end, fix ϵ>0, and let KL:=min{k∈N:e−α(k−aL)+β≤δ4} and KR:=min{k∈N:e−α(k+aR)+β≤δ4}, so that max(KL,KR)≲log(1/δ). By these definitions, aL−KL≤bL and aR+KR≥bR. We segment [bL,aL) into subintervals Sk for k=1,…,KL, where
[TABLE]
Define ϵˉ:=ϵ/(4KL1/2). For any r∈[bL,aL), we have that f0(r)≤δ2 because r<aL and, moreover, e−α∣r∣+β≥δ4 because r≥bL. Now, by Lemma S15, f(r)≤f0(r)eCδlogδ1≲f0(r) for any f∈F~(f0,δ) and r∈[bL,aL). Hence, by Lemma S18,
[TABLE]
By symmetry, we obtain the same bound for H_{[]}\bigl{(}\epsilon/4,\tilde{\mathcal{F}}(f_{0},\delta),d_{\mathrm{H}},[a_{R},b_{R}]\bigr{)}.
Now we bracket the region (−∞,bL)∪(bR,∞). For k∈N, define Sk:=[bL−k,bL−(k−1)) and set ϵk:=Cϵe−α(k−bL)/4 where C>0 is a constant chosen such that ∑k=1∞ϵk2≤ϵ2/16. Then,
by Lemma S18 again,
[TABLE]
The same bound holds for H_{[]}\bigl{(}\epsilon/4,\tilde{\mathcal{F}}(f_{0},\delta),d_{\mathrm{H}},[b_{R},\infty)\bigr{)}.
Next, we bracket the region [aL,−1/16]. To this end, we write s0:=aL and partition [s0,−1/16] into segments [s0,s1), [s1,s2),…,[sJ−2,sJ−1), [sJ−1,sJ] (where sJ:=−1/16) as follows:
Choose s1>s0 such that ∫s0s1f0(t)dt=4δ2.
2. 2.
For each j≥2, if there exists t0<−1/16 such that ∫sj−1t0f0(t)dt≥2∫−∞sj−1f0(t)dt, then choose sj such that ∫sj−1sjf0(t)dt=2∫−∞sj−1f0(t)dt. Otherwise, set J:=j and choose sJ=−1/16.
Define ϕ0:=logf0 and write Rangej(ϕ0):=supt∈[sj−1,sj]ϕ0(t)−inft∈[sj−1,sj]ϕ0(t). We make the following six claims:
(1)
s1<−1/16;
2. (2)
(s1−s0)supt∈[s0,s1]f0(t)≲δ2logδ1;
3. (3)
∫sJ∞f0(t)dt>2−11;
4. (4)
(s_{j}-s_{j-1})\sup_{t\in[s_{j-1},s_{j}]}f_{0}(t)\lesssim\bigl{\{}\mathrm{Range}_{j}(\phi_{0})+\log 2\bigr{\}}\int_{-\infty}^{s_{j-1}}f_{0}(t)\,dt for j=2,…,J−1;
5. (5)
∫−∞sJf0(t)dt≥2−13;
6. (6)
J≲log(1/δ).
To verify claim (1), observe by Lovász and Vempala (2007, Theorem 5.14(a) and (d)) that f0(t)≥2−8 for all t∈[−1/9,1/9]. Hence ∫s0−1/16f0(t)dt≥(1/9−1/16)2−8>4δ2, so s1<−1/16.
For claim (2), note that −2log(1/δ)≤ϕ0(t)≤0 for t∈[s0,s1] by Feng et al. (2018, Proposition S2(iii)). Thus by the second part of Lemma S17 and the definition of s1, we have
[TABLE]
For claim (3), we have ∫sJ∞f0(t)dt≥∫−1/161/9f0(t)dt≥2−8(1/9+1/16)>2−11.
For claim (4), observe that 2∫−∞sj−1f0(t)dt=∫sj−1sjf0(t)dt for j=2,…,J. Hence
Now set ϵ~:=ϵ/(2J1/2). Then, by Lemma S18, claim (2), and Lemma S15,
[TABLE]
Now let j∈{2,…,J}, and observe by claim (1) that s1,…,sJ are strictly increasing. Let \check{\mathcal{F}}(f_{0},\delta):=\bigl{\{}e^{\phi-\phi_{0}}:e^{\phi}\in\tilde{\mathcal{F}}(f_{0},\delta)\bigr{\}}. Let {(ψˇℓL,ψˇℓU):ℓ∈[N]} be an ϵ~-Hellinger bracketing set for Fˇ(f0,δ) with \log N=H_{[]}\bigl{(}\tilde{\epsilon},\check{\mathcal{F}}(f_{0},\delta),d_{\mathrm{H}},[s_{j-1},s_{j}]\bigr{)}; we define {(ψ~ℓL,ψ~ℓU):ℓ=1,…,N} by ψ~ℓL:=f0ψˇℓL and ψ~ℓU:=f0ψˇℓU. Then
[TABLE]
Moreover, if ψˇℓL≤eϕ−ϕ0≤ψˇℓU, then ψ~ℓL≤eϕ≤ψ~ℓU. We deduce that \bigl{\{}(\tilde{\psi}_{\ell}^{L},\tilde{\psi}_{\ell}^{U}):\ell\in[N]\bigr{\}} form an ϵ~supt∈[sj−1,sj]f0(t)1/2-Hellinger bracketing set for F~(f0,δ,[sj−1,sj]).
Now, on [sj−1,sj], the conditions of Lemma S14 are fulfilled with r=sj because ∫−∞sj−1f0(t)dt≥∫s0s1f0(t)dt=4δ2 and ∫sj∞f0(t)dt>2−11≥4δ2 by claim (3). Thus, we may combine Lemma S19 with Lemmas S14 and S15 with claim (4) to obtain
[TABLE]
where the final bound follows because sj−12∫−∞sj−1f0(t)dt≤1 by Markov’s inequality.
By symmetry, we obtain the same bracketing entropy bound over [1/16,aR]. For the region [−1/16,1/16], since ∫−∞−1/16f0(t)dt≥2−13 and ∫1/16∞f0(t)dt≥2−13 by claim (5), we may argue as in (S4.2.1) to obtain
[TABLE]
Now, since ϕ0 is unimodal and 0≥ϕ0≥−2log(1/δ) on [aL,aR], it holds that
Let eϕ∈F~(eϕ0,δ) for some concave ϕ0:R→[−∞,∞) and δ∈(0,∞) and let r∈R. If ∫−∞reϕ0(t)dt∧∫r∞eϕ0(t)dt≥4δ2, then
[TABLE]
Proof.
As a shorthand, let us write ψ:=(ϕ−ϕ0)/2. Assume without loss of generality that ψ(r)<0 (because otherwise the result is immediate). Since ψ is concave, either ψ(t)≤ψ(r) for all t∈(−∞,r) or for all t∈(r,∞). In the former case,
[TABLE]
Hence
[TABLE]
On the other hand, if ψ(t)<ψ(r) for t∈(r,∞), we can apply an almost identical argument to see that
[TABLE]
as required.
∎
Lemma S15**.**
Let f0=eϕ0∈F1 with μf0=0 and σf0=1, and let eϕ∈F~(f0,δ) for some δ∈(0,2−8]. Then there exists a universal constant C>0 such that for any r∈R,
[TABLE]
Proof.
Again, we write ψ:=(ϕ−ϕ0)/2. Since we seek an upper bound for ψ, we may assume without loss of generality that ψ is upper semi-continuous, and by symmetry, it suffices to prove the bound at a fixed r≥0. Further, we assume without loss of generality that ψ(r)>0 (because otherwise the result is immediate).
Let r0:=r+1. Define S:=\bigl{\{}t\in[-r_{0},r_{0}]\,:\,\psi(t)\geq\frac{\psi(r)}{2^{16}r_{0}}\bigr{\}}, sL:=infS, and sR:=supS. We note that r∈S since 216r0≥2. Then, since ex−1≥x for any x≥0, we have
[TABLE]
Now, define S^{\prime}:=\bigl{\{}t\in[-r_{0},r_{0}]\,:\,\psi(t)\geq-\frac{\psi(r)}{2^{16}r_{0}}\bigr{\}}, sL′:=infS′, and sR′:=supS′. Then
Since T1+T2+T3=∫−r0r0eϕ0(t)dt, we have that T1+T2+T3≥2−12 by Lemma S16.
We claim that T2≤2−13. To see this, note that by concavity of ψ (cf. the proof of Theorem 1 of Cule et al. (2010)),
[TABLE]
By Feng et al. (2018, Proposition S2(iii)), supt∈Reϕ0(t)≤1, hence,
[TABLE]
It follows that either T1≥2−14 or T3≥2−14. If T1≥2−14, then the result follows from (S4.37). On the other hand, if T3≥2−14, then we obtain the desired result from the fact that δ≤2−8, that −log(1−x)≤2x for all x∈[0,1/2], and (S4.38).
∎
Lemma S16**.**
Let f∈F1 be such that μf=0 and σf=1. We have that aL:=inf{r:f(r)≥2−8}≤−1/9, aR:=sup{r:f(r)≥2−8}≥1/9, and
[TABLE]
Proof.
By Lovász and Vempala (2007, Theorem 5.14(a) and 5.14(d)), 2−7≤f(0)≤24 and f(r)≥2−8 for all r∈[−1/9,1/9]. We immediately obtain that aL≤−1/9 and aR≥1/9. Moreover,
[TABLE]
as required.
∎
Lemma S17**.**
Let ϕ0:R→[−∞,∞) be a concave function, let a,b∈R where a<b, and let
[TABLE]
Then for any t∗∈[a,b], it holds that
[TABLE]
Moreover, if ϕ0(t∗)≤max{ϕ0(a),ϕ0(b)}+τ for some τ≥log2, then
[TABLE]
Proof.
Let us first suppose that t∗>a. We have ϕ0(t)≥sϕ0(a)+(1−s)ϕ0(t∗) for t∈[a,t∗], where s:=(t∗−t)/(t∗−a). Hence
[TABLE]
We can bound ∫t∗beϕ0(t)dt when t∗<b by a similar argument to yield the first conclusion.
For the second part, observe that q is strictly decreasing, so from (S4.40),
[TABLE]
for τ≥log2. We may bound ∫t∗beϕ0(t)dt by a similar argument to obtain the desired conclusion.
∎
The following two lemmas are from Kim et al. (2018a), though the first is only a minor restatement of Doss and Wellner (2016, Theorem 4.1). For a<b and −∞≤B1<B2<∞, we define F~([a,b],B1,B2) to be the set of log-concave functions f:[a,b]→[eB1,eB2].
We first describe the common setting for all the lemmas in this subsection and define the notation used throughout. We fix K,K^∈K, μ,μ^∈Rp Let f0∈FpK,μ be of the form f0(x)=eϕ0(∥x−μ∥K) for some ϕ0∈Φ and let X1,…,Xn∼iidf0. We assume in this subsection that
[TABLE]
This assumption can be made without loss of generality as shown in the proof of Proposition 10. Let h0 be a density on [0,∞) be of the form h0(r)=pλp(K)rp−1eϕ0(r). Define fˇn:Rp→[0,∞) by fˇn(x):=eϕ0(∥x−μ^∥K^). We note that fˇn is not necessarily a density. Let
[TABLE]
We also define a deformation ϕ~0∈Φ of ϕ0 by
[TABLE]
and write h~0(r):=γ−1pλp(K^)rp−1eϕ~0(r) where γ:=pλp(K^)∫0∞rp−1eϕ~0(r)dr so that h~0 is a density.
For i∈[n], let Zi:=∥Xi−μ∥K and Z~i:=∥Xi−μ^∥K^. Then Zi has density h0 and, by Lemma S24 below, Z~1 has a density which we denote by h~n. We let Q0 and Q~n denote the probability distributions induced by h0 and h~n respectively and let Q~n denote the empirical distribution corresponding to Z~1,…,Z~n, so that h^n:=h∗(Q~n). We write h^n(r)=pλp(K^)rp−1eϕ^n(r) for some ϕ^n∈Φ, and set f^n:=eϕ^n. Similarly to (S4.27) we write \mathcal{H}(h_{0},C_{\mu},C_{\sigma}):=\bigl{\{}h\in\mathcal{H}\,:\,|\mu_{h}-\mu_{h_{0}}|\leq C_{\mu}\sigma_{h_{0}},\,C_{\sigma}^{-1}\leq\sigma_{h}/\sigma_{h_{0}}\leq C_{\sigma}\bigr{\}}.
Lemma S20**.**
There exist universal constants c1,c2,C,Cμ>0 and Cσ>1 such that if ∥μ^−μ∥Kplog(ep)≤c1 and p(τ∗−τ∗)≤c2, then
As a preliminary step, we first claim that there exist universal constants c1,c2,Cσ′>0 such that if ∥μ^−μ∥Kplog(ep)≤c1 and p(τ∗−τ∗)≤c2, then the following statements hold simultaneously:
[TABLE]
Provided we choose universal constants c1,c2>0 sufficiently small, claim (a) follows from Lemma S28, while claim (d) follows from the second claim of Lemma S26 and Lemma S12. For claim (b), observe that since μh0≲pσh0 (which holds by Lemma S6), we have that σh02≳(σh02+μh02)(1+p2)−1≳p−1. Thus, by Lemma S22, we may reduce the values of c1,c2 if necessary to obtain
[TABLE]
For claim (c), observe that \textrm{Var}(\tilde{Z}_{1}-Z_{1})\leq\mathbb{E}\bigl{\{}(\tilde{Z}_{1}-Z_{1})^{2}\bigr{\}}\leq\sigma_{h_{0}}^{2}/16. We therefore have by Cauchy–Schwarz and claim (b) that,
[TABLE]
and
[TABLE]
This establishes claim (c).
Define E_{1}:=\bigl{\{}|n^{-1}\sum_{i=1}^{n}\tilde{Z}_{i}-\mathbb{E}\tilde{Z}_{1}|\leq\sqrt{2}\sigma_{h_{0}}\bigr{\}}; from claim (c) of (S4.47) and Chebychev’s inequality, it holds that P(E1c)≤1/n. By Lemma S2 (with a=0) and claim (b) of (S4.47), there exists a universal constant Cμ>0 such that on E1,
We now consider σh^n2. By Lemma S23, S47, and claims (a) and (d) of (S4.47), there exists a universal constant C′>0 such that ∥h~n∥esssup≤γ∥h~0∥∞≤C′σh0−1. Define the event
We now obtain a lower bound for ∫0∞logh^ndQ~n. Note that by Cover and Thomas (2006, Theorem 8.6.5) and by claim (c) of (S4.47), we have that
[TABLE]
Now write g(r):=σh0h~n(σh0r) for r≥0 as a shorthand and observe that
[TABLE]
By claim (d) of (S4.47) and Feng et al. (2018, Proposition S2(iii)), there exist universal constants Cσ′>1,μ′∈[0,∞) such that σh0h~0(σh0s)≤Cσ′e−Cσ′−1∣s−μ′∣+1 for all s∈[0,∞). Thus, by claim (a) of (S4.47), for any t∈R,
[TABLE]
Combining (S4.54) and (S4.55), we deduce that there exists a universal constant C′′>0 such that
we have by (S4.57) and Chebychev’s inequality that P(E3c)≤1/n. Let R0:={r∈[0,∞):h~n(r)≤γh~0(r)} and let E4 denote the event that Q~n({0})<1 and Q~n(R0)=1. It holds then by Lemma S23 that P(E4c)=0.
Moreover, on E3∩E4,
[TABLE]
for some universal constant C3>0. We conclude that on the event E2∩E3∩E4, by Lemma S3, there exists a universal constant Cσ≥1 such that
[TABLE]
A union bound yields the desired result.
∎
For a Borel measurable function g:[0,∞)→R, define
[TABLE]
Even though ρ1 is not a norm, we can define the ϵ-generalised bracketing entropy of a class G of Borel measurable, real-valued functions by treating it as a norm, and continue to denote this by H[](ϵ,G,ρ1).
Lemma S21**.**
Set
[TABLE]
There exist universal constants c1,c2,C>0 such that if ∥μ^−μ∥Kplog(ep)≤c1 and p(τ∗−τ∗)≤c2, then, for any t≥an1/2,
[TABLE]
where Cμ>0 and Cσ>1 are taken from Lemma S20. Morever, if additionally f0(x)=e−a∥x−μ∥K+b for some a>0 and b∈R, and writing
Assume ∥μ^−μ∥Kplog(ep)≤c1 and p(τ∗−τ∗)≤c2 for universal constants c1,c2>0 chosen such that (a) h~0∈H(h0,Cμ,Cσ) where Cμ>0 and Cσ>1 are taken from Lemma S20 and (b) γ≤2. The existence of such a choice of c1,c2 is guaranteed by Lemma S12, Lemma S26, and Lemma S28.
We make two observations before proceeding with the main proof. First, for δ>0, define \mathcal{H}(\tilde{h}_{0},\delta):=\bigl{\{}h\in\mathcal{H}(h_{0},C_{\mu},C_{\sigma}):d_{\mathrm{H}}(h,\tilde{h}_{0})\leq\delta\bigr{\}}, and observe that h~0∈H(h~0,δ) for every δ>0. Second, we may assume that Q~n({0})=0 (since this is a probability 1 event), and thus, Q~n∈Q0 and h^n:=h∗(Q~n)=argmaxh∈H∫0∞loghdQ~n (recall the definition of H:=H0 in (4)). Therefore, since h~0∈H, we have by Proposition 4 and Lemma S23 that with probability 1,
By van de Geer (2000, Lemma 4.2), the fact that KL divergence is no smaller than the squared Hellinger distance, and (S4.60),
[TABLE]
Since H(h~0,δ) is nonempty for any δ>0, the bracketing entropy H_{[]}\bigl{(}\epsilon,\mathcal{H}(\tilde{h}_{0},\delta),d_{\mathrm{H}}\bigr{)} is well-defined and non-negative for any ϵ>0. We may therefore define Ψ:[0,∞)→[0,∞) by \Psi(\delta):=\delta\vee\int_{0}^{\delta}H_{[]}^{1/2}\bigl{(}\epsilon/2^{1/2},\mathcal{H}(\tilde{h}_{0},\delta),d_{\mathrm{H}}\bigr{)}\,d\epsilon and let \delta_{n}:=\inf\bigl{\{}\delta\in[0,\infty):\frac{\sqrt{n}\delta^{2}}{\Psi(\delta)}\geq 2^{9}C_{0}\bigr{\}} for a universal constant C0>0 specified in van de Geer (2000, Theorem 5.11). By Lemma S8, it holds that δn2≲n−4/5. Moreover, if f0(⋅)=e−a∥⋅−μ∥K+b for some a>0 and b∈R, then, by Lemma S11, we have that δn2≲n−1log5/4(en)+dH2(h~0,h0).
For δ>0, define \mathcal{G}(\delta):=\bigl{\{}\frac{1}{2}\log\frac{h+\tilde{h}_{0}}{2\tilde{h}_{0}}:h\in\mathcal{H}(\tilde{h}_{0},\delta)\bigr{\}}. Let (hU,hL) be an element from the ϵ/2-Hellinger bracketing set of H(h~0,δ). Define gU:=21log2h~0hU+h~0 and gL:=21log2h~0hL+h~0. We have by van de Geer (2000, Lemmas 7.1 and 4.2), Lemma S23 and the fact that γ≤2 that
[TABLE]
Moreover, if h∈H(h~0,δ) and g=21log2h~0h+h~0, then a virtually identical calculation to (S4.3) shows that supg∈G(δ)ρ12(g)≤δ2 for every δ>0. We therefore conclude that
[TABLE]
Fix any t>\bigl{\{}\delta^{2}_{n}+2^{7}d^{2}_{\mathrm{KL}}(\tilde{h}_{n},\tilde{h}_{0})\bigr{\}}^{1/2}, where we note that this lower bound is finite by Lemma S23. For each s∈N∪{0}, define the events \mathcal{A}_{s}:=\bigl{\{}2^{s}t<d_{\mathrm{H}}\bigl{(}\hat{h}_{n},\tilde{h}_{0}\bigr{)}\leq 2^{s+1}t\bigr{\}}. Writing g^n:=21log2h~0h^n+h~0, we note that on As∩{h^n∈H(h0,Cμ,Cσ)}, by (S4.3),
Therefore, by (S4.66), (S4.67), and van de Geer (2000, Theorem 5.11), there exists a universal constant C>0 such that
[TABLE]
The lemma follows from (S4.61), (S4.3), (S4.68), and the fact that δn2≲n−4/5 in general and δn2≲n−1log5/4(en)+dH2(h~0,h0) when f0 has the form f0(⋅)=e−a∥⋅−μ∥K+b for some a>0 and b∈R.
∎
Lemma S22**.**
Let x0,μ,μ^∈Rp and let K,K^∈K. Writing ξ:=μ^−μ, we have
[TABLE]
Proof.
By the subadditivity of ∥⋅∥K (Proposition 1(iv)), we have
[TABLE]
and
[TABLE]
This yields the bound with the first term in the minimum, and the bound with the second term follows analogously.
∎
Lemma S23**.**
For every x∈Rp, we have that ϕ~0(∥x−μ^∥K^)≥ϕ0(∥x−μ∥K). Moreover, for λ1-almost every r∈[0,∞), we have that
h~n(r)≤γh~0(r).
Proof.
Let x∈Rp. If ∥x−μ^∥K^≤∥μ^−μ∥K^, then, since ϕ0 is decreasing, ϕ~0(∥x−μ^∥K^)=ϕ0(0)≥ϕ0(∥x−μ∥K). On the other hand, if ∥x−μ^∥K^>∥μ^−μ∥K^, then
[TABLE]
Hence, \tilde{\phi}_{0}(\|x-\hat{\mu}\|_{\hat{K}})=\phi_{0}\Bigl{(}\frac{\|x-\hat{\mu}\|_{\hat{K}}-\|\hat{\mu}-\mu\|_{\hat{K}}}{\tau^{*}}\Bigr{)}\geq\phi_{0}(\|x-\mu\|_{K}). This proves the first claim of the lemma.
For 0<r1≤r2, we write BK^(μ^;r1,r2):={x∈Rp:∥x−μ^∥K^∈(r1,r2]}. Then, by the first claim of the lemma and Lemma S1, for any r∈[0,∞) and ϵ>0,
[TABLE]
We now take the limit as ϵ↘0 on both sides. On the left-hand side, we may apply Lemma S24 to conclude that the limit is h~n(r) for λ1-almost all r∈[0,∞). On the right-hand side, the limit is γh0(r) whenever r is a continuity point of ϕ~0 (i.e. λ1-almost everywhere, since ϕ~0 is decreasing).
∎
Definition:
Let νp be a signed measure on (Rp,B(Rp)) and let K∈K. We refer to the signed measure ν1K on R defined by ν1K(E):=νp({x∈Rp:∥x∥K∈E}) for E∈B(R) as the K-contour measure of νp.
The following lemma, among other things, implies that if a random vector X has a density on Rp, then ∥X∥K has a density on [0,∞) as well.
Lemma S24**.**
Let νp be a signed measure on Rp with νp≪λp. Let K∈K and let ν1K be the K-contour measure of νp. Then ν1K≪λ1.
Proof.
Define g:Rp→[0,∞) by g(x):=∥x∥K. We first claim that for any Borel measurable E⊆[0,∞) such that λ1(E)=0, we have \lambda_{p}\bigl{(}g^{-1}(E)\bigr{)}=0.
Let E⊆[0,∞) be a Borel measurable set such that λ1(E)=0, let n∈N, and let En:=E∩[0,n]. Now let ϵ>0 be fixed. Since λ1(En)=0, there exist disjoint intervals {(bm−ϵm,bm]:m∈N} such that En∖{0}⊆∪m=1∞(bm−ϵm,bm] and \lambda_{1}\bigl{(}\cup_{m=1}^{\infty}(b_{m}-\epsilon_{m},b_{m}]\bigr{)}\leq\epsilon. Then by the mean value theorem, for any M∈N,
[TABLE]
We therefore deduce that
[TABLE]
Since ϵ>0 was arbitrary, \lambda_{p}\bigl{(}g^{-1}(E_{n})\bigr{)}=0, so \lambda_{p}\bigl{(}g^{-1}(E)\bigr{)}=\lim_{n\rightarrow\infty}\lambda_{p}\bigl{(}g^{-1}(E_{n})\bigr{)}=0, which establishes the claim.
Hence, if E is a Borel measurable subset of [0,∞) with λ1(E)=0, then since νp≪λp, we have \nu_{1}^{K}(E)=\nu_{p}\bigl{(}g^{-1}(E)\bigr{)}=0, as required.
∎
S4.3.1 Auxiliary lemmas for the worst case risk bounds of Section 4
We continue to use the setting and notation defined in Subsection S4.3.
Lemma S25**.**
If ∥μ^−μ∥Kplog(ep)≤1 and p(τ∗−τ∗)≤1/2, then
[TABLE]
Proof.
By Fubini’s theorem,
[TABLE]
For any μ′∈Rp and r∈[0,∞), define BK(μ′,r):={x:∥x−μ′∥K≤r} and define BK^(μ′,r) analogously. Define r0:(−∞,ϕ0(0)]→[0,∞) by r0(t):=sup{r≥0:ϕ0(r)≥t} for any t∈(−∞,ϕ0(0)]. Since r0(t)≥r if and only if ϕ0(r)≥t for any r∈[0,∞) and t∈(−∞,ϕ0(0)], we have
[TABLE]
Let us denote ξ:=μ^−μ and K+[0,ξ]:={x∈Rp:x=x′+αξ for x′∈K,α∈[0,1]} so that, for any r≥0, we have rK∪(rK+ξ)⊆rK+[0,ξ]. To upper bound the first term of (S4.71), we obtain from the translation invariance of Lebesgue measure, its corresponding scaling property and Proposition 1(iv) that
[TABLE]
To upper bound the second term of (S4.71), note that by the translation invariance of Lebesgue measure again,
Recall that μh0=pλp(K)∫0∞rpeϕ0(r)dr and σh02=pλp(K)∫0∞(r−μh0)2rp−1eϕ0(r)dr. We now make a few observations that are used repeatedly in the remainder of this proof:
(i)
By Feng et al. (2018), it holds that pλp(K)rp−1eϕ0(r)≤σh01e−σh01∣r−μh0∣+1 for all r∈[0,∞).
2. (ii)
By (S4.41) and Lemma S1, it holds that pλp(K)∫0∞rp+1eϕ0(r)dr=p. Thus, σh0≤1, by Bobkov (2003, Lemma 1).
3. (iii)
Since μh02+σh02=p, we have μh0∈[(p−1)1/2,p1/2].
4. (iv)
From (ii) and (iii) as well as Lemma S6, there exists a universal constant c∗∈(0,1/2] such that σh0≥c∗/p1/2.
Suppose first that p≥2. Define r′:=log(ep)c∗; we observe that r′≤1/2≤μh0/2 since c∗≤1/2. We also define ϕ∗:[0,∞)→R by
[TABLE]
We then have that
[TABLE]
Since c∗/(2σh0)≤p1/2/2≤(p−1)/μh0, we see that ϕ∗ is a decreasing function and, since ϕ0(0)≤ϕ0(r′)+r′μh0−r′ϕ0(r′)−ϕ0(μh0), we also have that ϕ∗(r)≥ϕ0(r) for all r∈[0,∞). We now claim that
[TABLE]
To see this, note that if ϕ∗(0)=ϕ∗(r′), then pλp(K)r′p−1eϕ∗(0)≤σh01e−2σh0c∗(μh0−r′)+1≲1 as required. On the other hand, suppose that ϕ∗(0)=ϕ0(r′)+r′μh0−r′ϕ0(r′)−ϕ0(μh0). Then by the proof of Lemma S47,
[TABLE]
and moreover,
[TABLE]
From the fact that r′log(1/r′)→0 as p→∞, we deduce that there exists a universal constants C∗,C∗∗>0 such that
[TABLE]
Therefore, we also obtain in this case that
[TABLE]
Define r∗:(−∞,ϕ∗(0)]→[0,∞) by r∗(t):=sup{r≥0:ϕ∗(r)≥t}. It follows from the fact ϕ∗(r)≥ϕ0(r) for all r∈[0,∞) that r0(t)≤r∗(t) for all t∈(−∞,ϕ0(0)]. Hence, by a change of a variable and our assumption on ∥ξ∥Kplog(ep),
[TABLE]
where the final inequality follows from (S4.79). Now define r′′:=μh0/2 and note that
[TABLE]
Hence
[TABLE]
Returning to (S4.74), by a very similar argument, we also have that
and observe, as in the case when p≥2, that ϕ∗ is decreasing, that ϕ∗(r)≥ϕ0(r) for all r∈[0,∞), and that ϕ∗′(r)=0 for r∈[0,μh0) and ϕ∗′(r)=−σh01 for r∈(μh0,∞). By applying the same argument as for the case where p≥2, we obtain the conclusion of the lemma.
∎
We then follow the proof of Lemma S28 and use (S4.92) and Lemma S29 to obtain the first statement of the lemma. For the second statement of the lemma, observe that we may use Lemma S29 and our assumption on p(τ∗−τ∗) to obtain
[TABLE]
as desired.
∎
Lemma S27**.**
If ∥μ^−μ∥Kplog(ep)≤1 and p(τ∗−τ∗)≤1/2, then
[TABLE]
Proof.
By Lemma S23, the fact that log(1+x)≤x for all x∈[0,∞), and the fact that γ≥1,
[TABLE]
By Lemma S25, Lemma S28, and (S4.92) in the proof of Lemma S28, we have that
[TABLE]
We observe that h~n is the density of the K^-contour measure (see definition above Lemma S24) of the probability measure induced by f0(⋅+μ^). By (S4.86) and Lemma S30, it holds that
We define r0,r~0:(−∞,ϕ0(0)]→[0,∞) by r0(t):=sup{r≥0:ϕ0(r)≥t} and r~0(t):=sup{r≥0:ϕ~0(r)≥t}; we also write ξ:=μ^−μ. Observe that \tilde{\phi}_{0}\bigl{(}\tau^{*}(r_{0}(t)+\|\xi\|_{\hat{K}}+\epsilon)\bigr{)}<t for any t∈(−∞,ϕ0(0)] and ϵ>0, so r~0(t)≤τ∗(r0(t)+∥ξ∥K^). By Fubini’s theorem,
[TABLE]
Let us first assume p≥2 and define r′, ϕ∗, and r∗ as in the proof of Lemma S25. Recall from the proof of Lemma S25 that r∗(t)≥r0(t) for all t∈(−∞,ϕ0(0)]. Since τ∗≥1, it also holds that τ∗p(r0(t)+∥ξ∥K^)p−r0(t)p≤τ∗p(r∗(t)+∥ξ∥K^)p−r∗(t)p for all t∈(−∞,ϕ0(0)]. We may now follow the proof of Lemma S25 and apply the assumption on ∥ξ∥Kplog(ep) and a change of variable to obtain
[TABLE]
Now, in order to bound the first term of (S4.89), we use (S4.79), the assumptions on τ∗ and ∥ξ∥K, and the fact that ∥ξ∥K^≤∥ξ∥Kτ∗≤2∥ξ∥K to obtain
[TABLE]
To bound the second term of (S4.89), we again follow the proof of Lemma S25 and use (S4.81) and (S4.82):
The desired result therefore follows from (S4.87).
If p=1, then we define ϕ∗ as in (S4.85) and obtain the same bound. Thus, in all cases,
[TABLE]
as required.
∎
Lemma S29**.**
If p(1−τ∗)≤1/2, then
[TABLE]
Proof.
We first prove the upper bound on λp(K^)/λp(K). Since −plog(1−x/p)≤−log(1−x) for x∈(0,1), we have
[TABLE]
when p(1−τ∗)≤1/2, as required.
For the lower bound on λp(K^)/λp(K), by the fact that −plog(1+x/p)≥log(1−x) for x∈(0,1), we have
[TABLE]
as required.
∎
Lemma S30**.**
Let K∈K and let f be a density on Rp with corresponding distribution ν. Let h:[0,∞)→[0,∞) be the density of the K-contour measure (see Definition before Lemma S24) of ν. Let g:[0,∞)→[0,∞) be a continuous function such that ∫Rpg(∥x∥K)dx=1 and suppose that
[TABLE]
Then
[TABLE]
Proof.
By condition (S4.93), we may define a signed measure π on (Rp,B(Rp)) by π(A):=∫Af(x)logg(∥x∥K)f(x)dx for any Borel measurable A. Let π1K denote the K-contour measure of π. Since π≪λp, we have by Lemma S24 that π1K≪λ1. Let L~:[0,∞)→[0,∞) denote the Radon–Nikodym derivative of π1K with respect to λ1. For any r∈(0,∞) and ϵ∈(0,r), define Ar,ϵ:=(r+ϵ)K∖rK.
We observe that, by (S4.93) and the fact that the integral in that assumption is also non-negative by the Gibbs inequality, π is locally bounded on Rp in the sense that ∣π(B)∣<∞ for every bounded B∈B(Rp) and therefore, π1K is locally bounded on [0,∞). Then, by the Lebesgue differentiation theorem (Folland, 2013, Theorem 3.21), for λ1-almost every r∈(0,∞),
[TABLE]
Moreover, we claim that π≪ν. To see this, let A∈B(Rp) be such that ν(A)=0. Then, by definition of ν, we have that f is λ1-almost everywhere [math] on A and thus we conclude that π(A)=0 as well, which establishes the claim. By (S4.93) again, it must be that ν1K(g−1(0))=ν({x:g(∥x∥K)=0})=0 and thus π1K(g−1(0))=0. Hence, we have that L~(r)=0 for λ1-almost every r∈[0,∞) such that g(r)=0.
For r>0, let us define
[TABLE]
and define L(0)=0. For any r such that g(r)>0, we have by continuity of g that
[TABLE]
so L(r)=L~(r)λ1-almost everywhere.
We also observe that ν is locally bounded since f is a density. Thus, ν1K is locally bounded and consequently, there exists E\in\mathcal{B}\bigl{(}(0,\infty)\bigr{)} such that limϵ↘0ϵ−1∫Ar,ϵf(x)dx=h(r) for every r∈E and \lambda_{1}\bigl{(}[0,\infty)\setminus E\bigr{)}=0. Fix ϵ>0 and r∈E with g(r)>0. Since y↦ylogy is convex on [0,∞) (with the convention that 0log0=0), we have by Jensen’s inequality that
[TABLE]
Hence, from (S4.94) and the equality case of Minkowski’s first inequality for convex bodies (Gardner, 2002, Section 5),
[TABLE]
We therefore conclude that
[TABLE]
as required.
∎
S4.3.2 Auxiliary lemmas for the adaptive risk bounds of Section 4
Lemma S31**.**
Suppose that p(τ∗−τ∗)≤1/2 and that ∥μ^−μ∥K≤p1/2. If ϕ0′ is absolutely continuous and differentiable, and there exists D0>0 such that infr∈[0,∞)ϕ0′′(r)≥−D0, then
[TABLE]
Proof.
We have by absolute continuity of ϕ0′ that
[TABLE]
For any a∈[1/2,2], the function r↦apλp(K)(ar)p−1eϕ0(ar) is a density on [0,∞). Moreover, since the function r↦supa∈[1/2,2]apλp(K)(ar)p−1eϕ0(ar) is finite when integrated over [0,∞), we may, by Ash and Doleans-Dade (1999, Exercise 1.6.3), differentiate under the integral to obtain
[TABLE]
where the final inequality follows from Bobkov (2003, Lemma 1). From (S4.95) and (S4.3.2), we deduce that
[TABLE]
Using the inequality ∣ez−1∣≤∣z∣(1+ez) for z∈R, and writing z(x):={ϕ0(∥x−μ^∥K^)−ϕ0(∥x−μ∥K)}/2, we have
[TABLE]
As a shorthand, write Δ:=(τ∗−τ∗), Δ′:=(τ∗−1−τ∗−1) and ξ:=μ^−μ. By Lemma S22, Taylor’s theorem, and the facts that τ∗∈[1,2] and τ∗≤[1/2,1] by our assumption on p(τ∗−τ∗), we have that
where the final inequality follows because ∥ξ∥K^≤∥ξ∥Kτ∗ and Δ′=Δ/(τ∗τ∗). Combining (S4.98), (S4.100) and (S4.101) yields the desired conclusion.
∎
Lemma S32**.**
There exists universal constants c1,c2>0 such that if p(τ∗−τ∗)≤c2 and ∥μ^−μ∥Kplog(ep)≤c1 and if ϕ0′ is absolutely continuous and differentiable and there exists D0>0 such that infr∈[0,∞)ϕ0′′(r)≥−D0, then
[TABLE]
and
[TABLE]
Proof.
We first note that (S4.97) holds since we have the same assumptions on ϕ0 as Lemma S31. Define ψ:Rp→[0,∞) by
[TABLE]
so that ϕ~0(∥x∥K^)=ϕ0(ψ(x)). For x∈Rp, write z(x):={ϕ0(ψ(x))−logγ−ϕ0(∥x∥K^)}/2. By Lemma S1 and the inequality ∣ez−1∣≤∣z∣(1+ez) for z∈R,
[TABLE]
Let us write ξ:=μ^−μ. Since 0≤∥x∥K^−ψ(x)≤(1−τ∗−1)∥x∥K^+∥ξ∥K^, we have by Taylor’s theorem that
[TABLE]
Now, write Δ:=τ∗−τ∗, and note that Δ≥1−1/τ∗. By Lemma S28 and the fact that γ≥1 (a consequence of Lemma S23), we have that
[TABLE]
Therefore, we have by Lemma S1, (S4.97), Lemma S34, (S4.106), and Lemma S29 that
[TABLE]
Similarly, but using Lemma S35 instead of Lemma S34, we have that
[TABLE]
The first statement of the lemma then follows from (S4.105), (S4.3.2) and (S4.3.2). For the second statement, note that, by our assumption on p(τ∗−τ∗),
[TABLE]
as desired.
∎
Lemma S33**.**
There exists universal constants c1,c2>0 such that if p(τ∗−τ∗)≤c2 and that ∥μ^−μ∥Kplog(ep)≤c1 and if ϕ0′ is absolutely continuous and differentiable and there exists D0>0 such that infr∈[0,∞)ϕ0′′(r)≥−D0, then
We also observe that h~n is the density of the K^-contour measure of the probability distribution induced by f0(⋅+μ^).
As a shorthand, for x∈Rp, let
[TABLE]
Then, by Lemma S30 (which is applicable by (S4.109)) and the fact that (1+z)log(1+z)≤z+z2 for any z∈(−1,∞), we have
[TABLE]
Define z(x):=ϕ0(∥x−μ∥K)+logγ−ϕ~0(∥x−μ^∥K^) for x∈Rp. By the fact that (ez−1)≤∣z∣(1+ez) for all z∈R, by Lemma S23, and by Lemma S28,
[TABLE]
where we used (S4.101) and (S4.3.2) to obtain the final inequality. The lemma therefore follows from our assumption on p(τ∗−τ∗).
∎
Lemma S34**.**
For any s∈N,
[TABLE]
Proof.
By Feng et al. (2018, Proposition S2(iii)) we have h0(r)≤σh01e−σh01∣r−μh0∣+1. Moreover, by Bobkov (2003, Lemma 1), μh0∈[(p−1)1/2,p1/2] and σh0≤1. Hence
[TABLE]
as desired.
∎
Lemma S35**.**
There exists a universal constant c1,c2>0 such that if ∥μ^−μ∥Kplog(ep)≤c1 and p(τ∗−τ∗)≤c2, then
[TABLE]
Proof.
Assume that ∥μ^−μ∥Kplog(ep)≤c1 and p(τ∗−τ∗)≤c2 where c1,c2>0 are universal constants chosen such that dH2(h~0,h0)≤2−16; the existence of such a choice of c1,c2 is guaranteed by Lemma S26. Consequently, by Lemma S12, there exists universal constants Cμ′>0 and Cσ′>1 such that Cσ′−1≤σh~0/σh0≤Cσ′ and that ∣μh~0−μh0∣≤Cμ′. Moreover, by Bobkov (2003, Lemma 1), μh0∈[(p−1)1/2,p1/2] and σh0≤1.
Therefore, by Feng et al. (2018, Proposition S2(iii)), we have that
We first describe the common setting for all the lemmas in this subsection as well as define some notation and quantities used throughout. We fix n≥8, p≥2 and K∈K; we assume, for some fixed 0<r1≤r2<∞, that Bp(0,r1)⊆K⊆Bp(0,r2) and write r0:=r2/r1. We suppose that X1,…,Xn+M∼iidf0∈FpK. Recall from Algorithm 1 then that θm=Xn+m/∥Xn+m∥ for m∈[M]. Define
[TABLE]
For m∈[M], we also define
[TABLE]
Note that with this definition, for any m∈[M], the random quantity Imk as defined in Algorithm 1 satisfies Imk=∣{i∈[n]:Xi∈Xmk}∣. For ϵ∈(0,1], define the spherical cone with centre θ∈Sp−1 as S(\theta,\epsilon):=\{x\in\mathbb{R}^{p}\,:\,x^{\top}\theta\geq\bigl{(}1-\frac{1}{2}\epsilon^{2}\bigr{)}\|x\|_{2}\}. We then define the events
[TABLE]
For ϵ>0, we say that a finite set Nϵ⊆Sp−1 is an ϵ-net of Sp−1 if for every x∈Sp−1, there exists y∈Nϵ such that x∈Sp−1∩S(y,ϵ).
The key results of this subsection are Lemmas S36 and S43, both of which are used in the proof of Proposition 13.
Lemma S36**.**
Suppose that Ef0(∥X1∥K)=Ef0(∥X1∥)=1 and that r0M−1/(p−1)log3n≤1/2. For m∈[M], let tm be defined as in Algorithm 1. Then there exists C1,p,r0>0, depending only on p and r0, such that, with probability at least 1−C1,p,r0/n,
[TABLE]
Proof.
For m∈[M], let sm:=∣Imk∣−1∑i∈Imk∥Xi∥K. We have that r0ϵ2≤1/2 under our assumptions on r0M−1/(p−1)log3n. Thus, on the event E, we have by Lemma S39 that
[TABLE]
and that
[TABLE]
and therefore
[TABLE]
Define the event \mathcal{E}_{2}:=\bigl{\{}\max_{m\in[M]}|s_{m}-\mathbb{E}s_{m}|\leq 4\bigl{(}\frac{M\log^{5}n}{n}\bigr{)}^{1/2}\bigr{\}}. To bound P(E2c), choose N0⊆[n] with n0:=∣N0∣≥n~. Let Em(N0) be the event that Imk=N0. Then, by Proposition 3 and Karlin et al. (1961, Theorem 5),
for any s≥2,
[TABLE]
By Proposition 3 again, Bernstein’s inequality (Boucheron et al., 2013, Corollary 2.11), and the fact that n0−1/2log1/2n≤1/2 under our assumption on n and r0M−1/(p−1)log3n, we therefore have that for each m∈[M],
[TABLE]
Hence, for each m∈[M],
[TABLE]
Thus, using a union bound over m∈[M], Lemma S38 (which we may apply since n~>1 under our assumption on r0M−1/(p−1)log3n), and the fact that M≤n,
[TABLE]
Now, under the assumption that Ef0(∥X1∥K)=1, we have Esm=1. Moreover,
[TABLE]
so r2≤r0. Thus, on the event E∩E2, for each m∈[M],
[TABLE]
Finally, by Lemma S37 and (S4.114), there exists C1,p,r0>0, depending only on p and r0, such that P(E∩E2)≥1−C1,p,r0/n as required.
∎
Lemma S37**.**
If ϵ2≤1, then there exists C2,p,r0>0, depending only on p and r0, such that
[TABLE]
Proof.
Let ϵ~2:=ϵ2/(4logn). By Lemma S41, there exists an ϵ~2-net Nϵ~2 of Sp−1 such that ∣Nϵ~2∣≤Cp′ϵ~2−(p−1) for some Cp′>0 depending only on p. Let Enet,1 be the event that for every y∈Nϵ~2, there exists some m∈[M] such that θm∈Sp−1∩S(y,ϵ~2). By Lemma S40 and a union bound, we have that
[TABLE]
say. Now for any y∈Sp−1, there exist y1,…,yk∈Sp−1 such that S(yj,ϵ~2)⊆S(y,ϵ2/2) for j∈[k], that y∈/S(yj,ϵ~2) for any j and that S(yj,ϵ~2)∩S(yj′,ϵ~2)=∅ for j=j′. Thus, on the event Enet,1, for any θm, there exist m1,…,mk∈[M] not equal to m such that θm1,…,θmk∈Sp−1∩S(θm,ϵ2/2). Thus, on the event Enet,1, we have Xmk⊆S(θm,ϵ2) for every m.
Now let ϵ~1:=2ϵ1 and note that 2ϵ1≤ϵ2≤1. For any m∈[M], let Am:={m′∈[M]:θm′∈S(θm,ϵ~1)}. Let Enet,2 be the event that maxm∈[M]∣Am∣≤k. Then by Lemma S40 again,
[TABLE]
say. On the event Enet,2, we have S(θm,ϵ1)⊆Xmk for every m∈[M]. Setting C2,p,r0:=C2,p,r0′+C2,p,r0′′ and applying a union bound, we obtain the conclusion of the lemma as desired.
∎
Lemma S38**.**
If ϵ2≤1 and n~>1, then there exists C3,p,r0>0, depending only on p and r0, such that
[TABLE]
Proof.
Since n~>1, we have that Mn≥log4n. Thus, we have by Lemma S40 that for n≥8 (so that n~<n/2),
Fix ϵ∈(0,1] and let Sb(θ,ϵ):={x∈Bp(0,1):x⊤θ=(1−ϵ2/2)} so that Sb(θ,ϵ) is the base of the spherical cap Bp(0,1)∩S(θ,ϵ). Define also St(θ,ϵ):=S(θ,ϵ)∩{x∈Rp:x⊤θ=1}. We observe then that
[TABLE]
We therefore have that
[TABLE]
For the lower bound, we have
[TABLE]
as desired.
∎
The following lemma is well-known and follows from the fact that the surface area of a spherical cap Sp−1∩S(θ,ϵ) scales as ϵp−1 up to a multiplicative constant depending only on p (Li, 2011). We omit the proof for brevity.
Lemma S41**.**
There exists Cp′>0, depending only on p, such that, for every ϵ∈(0,1], there exists an ϵ-net Nϵ of Sp−1 of cardinality ∣Nϵ∣≤Cp′ϵ−(p−1).
For K∈K, let hK be the support function of K, i.e., hK(u):=sup{x⊤u:x∈K} for u∈Sp−1. For any u∈Sp−1 and ϵ∈(0,1), define CK(u,ϵ):={x∈K:u⊤x≥hK(u)−ϵ}. For A,B⊆Rp, we define the Hausdorff distance dHaus(A,B):=inf{ϵ>0:A⊆B+ϵBp(0,1),B⊆A+ϵBp(0,1)}.
Let K∈K with K⊆Bp(0,1) and let ν be probability distribution supported on K. Suppose there exist α≥1, L>0 and ϵ0>0 such that \nu\bigl{(}C_{K}(u,\epsilon)\bigr{)}\geq L\epsilon^{\alpha} for all ϵ∈(0,ϵ0]. Let Y1,…,YM be independent random vectors with distribution ν and let \tilde{K}:=\mathrm{conv}\bigl{\{}Y_{1},\ldots,Y_{M}\}. Let τ1:=max(1,p/(αL)) and aM:=(τ1M−1logM)1/α. If 4aM≤ϵ0, then*
[TABLE]
Lemma S43**.**
Let Z1,…,ZM be independent random vectors distributed uniformly on K. Let Ym:=Zm/∥Zm∥K for m∈[M], and let K~:=conv{Y1,…,YM}. If M/logM≥r02(p−1)64p−1, then
[TABLE]
Proof.
Since dscale is scale invariant, we assume without loss of generality that r2=1 so that r0=1/r1. Let ν denote the distribution of Y1. We claim that ν satisfies the hypothesis of Theorem S42 with L=r1p−1/(2p), ϵ0=r1, and α=p−1.
To see this, let ϵ∈(0,r1] and u∈Sp−1 be arbitrary and let x∗∈∂K be such that hK(u)=u⊤x∗. We write C^{0}_{K}(u,\epsilon)=\mathrm{conv}\bigl{\{}\{0\}\cup C_{K}(u,\epsilon)\bigr{\}}. Let D:={x∈K:x⊤u=hK(u)−ϵ} be the base of CK(u,ϵ) and define D^{0}=\mathrm{conv}\bigl{\{}\{0\}\cup D\bigr{\}} and D^{x^{*}}:=\mathrm{conv}\bigl{\{}\{x^{*}\}\cup D\bigr{\}}. Let C∗ be the conical pyramid connecting Bp(0,r1)∩{x∈Rp:x⊤u=0} and x∗ and let D∗=C∗∩D. Then, by an application of Corollary S45, the fact that hK(u)≤sup{∥x∥2:x∈K}≤1, the fact that ϵ≤r1≤hK(u), and the fact that κp−1/κp≥1/2,
[TABLE]
Let τ1 and aM be defined as in Theorem S42. Since p/(αL)=p−12p2r1−(p−1)≥1, we have that τ1=max(1,p/(αL))=p/(αL). Moreover, p/(αL)≤8p−1/r1p−1 and so a_{M}\leq\frac{8}{r_{1}}\bigl{(}\frac{\log M}{M}\bigr{)}^{1/(p-1)}. By the assumption on M/logM, we have that 4aM≤ϵ0. Thus, writing
We now work on the event EHaus, which by our assumption on M/logM implies that dHaus(K~,K)<r1/2. We can therefore fix ϵ∈(r1−1dHaus(K~,K),1/2]. Since r2=1, we have K⊆K~+ϵr1Bp(0,1)⊆K~+ϵK. Applying this recursively, we obtain K\subseteq\bigl{(}\sum_{r=0}^{R}\epsilon^{r}\bigr{)}\tilde{K}+\epsilon^{R+1}K for any R∈N. Because ∑r=0Rϵk≤1+2ϵ, it holds that for any x∈K, there exists {yR,zR}R∈N such that yR∈(1+2ϵ)K~, zR∈ϵR+1K and x=yR+zR. Thus, ∥x−yR∥≤ϵR+1 for all R∈N and so x is a limit point of (1+2ϵ)K~. Since K~ is closed, we conclude that x∈(1+2ϵ)K~ and hence that K⊆(1+2ϵ)K~. On the other hand, we have that K~⊆K by definition and so dscale(K~,K)≤2ϵ. Since ϵ∈(r1−1dHaus(K~,K),1/2] was chosen arbitrarily, we have that on the event EHaus,
[TABLE]
as desired.
∎
Recall that for m∈(0,p], the m-dimensional Hausdorff outer measure of E⊆Rp is defined as
[TABLE]
Note that by this definition, if E⊆Rp is Lebesgue measurable, then λp,p(E)=λp(E) (e.g., Mattila, 1999, Section 4.3). We extend the definition of λp−1 to Borel subsets of Rp by writing λp−1(A):=λp−1,p(A) for a Borel measurable A⊆Rp. The next lemma shows that the Hausdorff measure contracts under a 1-Lipschitz map.
Lemma S44**.**
Let ϕ:Rp→Rp be a 1-Lipschitz mapping in the sense that for any x,y∈Rp, ∥ϕ(x)−ϕ(y)∥2≤∥x−y∥2. For m≤p, let λm,p be the m-dimensional Hausdorff outer measure as defined in (S4.120). Then, for any A⊆Rp,
[TABLE]
Proof.
For any E⊆Rp, we have that diam(E)≥diam(ϕ(E)). The claim follows immediately.
∎
Corollary S45**.**
Let A,B⊆Rp be compact convex sets with non-empty interiors. If B⊆A, then, λp−1(∂A)≥λp−1(∂B).
Proof.
For any x∈Rp, let ϕ(x) denote the Euclidean projection onto B. To see that ϕ is 1-Lipschitz, let x,y∈Rp and let x′:=ϕ(x) and y′:=ϕ(y). If x′=y′ then certainly ∥x′−y′∥2≤∥x−y∥2. On the other hand, if x′=y′, then, by the optimality conditions of x′,y′,
[TABLE]
We may then apply the Cauchy–Schwarz inequality to obtain that ∥x′−y′∥2≤∥x−y∥2.
If x∈∂A, then ϕ(x)∈∂B because otherwise we can find a point in B on the line segment between x and ϕ(x) that is closer to x than is ϕ(x). Thus, ϕ(∂A)⊆∂B. For any x′∈∂B, by the separating hyperplane theorem (e.g. Rockafellar, 1997, Theorem 11.6), there exists α∈Rp and b∈R such that α⊤x′+b=0 and α⊤z+b≤0 for all z∈B. Since B⊆A, there exists x∈∂A such that x=x′+cα for some c∈[0,∞). Moreover, for any z∈B,
[TABLE]
so ϕ(x)=x′. Thus, ϕ(∂A)=∂B and the result follows from Lemma S44.
∎
dscale(K′,K)=dscale(K,K′)* for any K,K′∈K.*
2. 2.
For any K,K′,K′′∈K such that dscale(K,K′)<1 or dscale(K′,K′′)<1, it holds that dscale(K,K′′)≤2dscale(K,K′)+2dscale(K′,K′′).
Proof.
The first claim follows from the fact that 1+ϵ1K⊆K′ if and only if K⊆(1+ϵ)K′.
For the second claim, let K,K′,K′′∈K, and suppose without loss of generality that dscale(K,K′)<1. Then there exist ϵ∈(0,1] and ϵ′>0 such that (1+ϵ)−1K⊆K′⊆(1+ϵ)K and (1+ϵ′)−1K′′⊆K′⊆(1+ϵ′)K′′. Since
[TABLE]
we have that dscale(K,K′′)≤ϵ+ϵ′+ϵϵ′≤2ϵ+2ϵ′. Since this holds for any ϵ>dscale(K,K′) and ϵ′>dscale(K′,K′′), the claim follows.
∎
S5 Technical lemmas
Lemma S47**.**
Let h∈F1.
(i)
There exists a universal constant c>0 such that σh≥c/h(μh);
2. (ii)
There exists a universal constant C>0 such that σh≤C/supr∈Rh(r).
Proof.
(i) Let f(r):=σhh(σhr+μh), so f is a log-concave density with μf=0 and σf=1. Then
[TABLE]
where the final inequality follows from Lovász and Vempala (2007, Theorem 5.14(d)).
(ii) Since h is upper semi-continuous and h(r)→0 as r→±∞, there exists r0∈R such that h(r0)=supr∈Rh(r). Thus, with f as above,
[TABLE]
by Lovász and Vempala (2007, Theorem 5.14(b) and (d)).
∎
Lemma S48**.**
For any f∈F1 with σf=1, we have that
[TABLE]
Proof.
By the location invariance of the entropy functional, we may assume without loss of generality that μf=0. Since ∥f∥∞≤29 (Lovász and Vempala, 2007, Theorem 5.14(b) and (d)), we have
[TABLE]
Now let g(x):=(2π)−1/2e−x2/2, so by the non-negativity of Kullback–Leibler divergence,
[TABLE]
as required.
∎
The next lemma provides basic properties of the function ρ defined in (S4.20).
Lemma S49**.**
For any p∈N and a≥0, we have
(i)
ρ(s)* is increasing;*
2. (ii)
sρ(s)* is decreasing;*
3. (iii)
ρ(s)∈[1,p]* for all s∈(0,∞).*
Proof.
(i) If a>0, define α:=(a+s)/a∈(1,∞), an increasing function of s. Then
[TABLE]
which is increasing in α, as required. If a=0, then ρ(s)=p, so the claim is also true.
(ii) If a>0, then
[TABLE]
which is decreasing in α. If a=0, then ρ(s)/s=p/s and the claim also follows.
(iii) When a=0, the result follows because ρ(s)=p for all s∈(0,∞). When a>0, the lower bound follows from (i) and the fact that lims↘0ρ(s)=1, while the upper bound follows from (i) and the fact that lims→∞ρ(s)=p.
∎
Lemma S50**.**
For any p≥1 and r0>a≥0, we have that
[TABLE]
Proof.
Writing x:=1−(a/r0)p, we are required to prove that for x∈(0,1],
[TABLE]
The inequality holds for sufficiently small x>0 and at x=1, and also when p=1. To finish the proof, it suffices to show that tp(x):=x1−(1−x)(p+1)/p is concave on (0,1) when p≥2. But for x∈(0,1),
[TABLE]
as required.
∎
Lemma S51**.**
Let p≥1 and x∈[−1/p,1/p]. Then
[TABLE]
Proof.
By Taylor’s theorem, there exists xˉ on the line segment joining [math] and x such that
[TABLE]
The lemma follows by noting that (1+xˉ)p−2≤e.
∎
Bibliography57
The reference list from the paper itself. Each links out to its DOI / PubMed record.
1(1)
2Adamczak et al. (2010) Adamczak, R., Litvak, A., Pajor, A. and Tomczak-Jaegermann, N. (2010). Quantitative estimates of the convergence of the empirical covariance matrix in log-concave ensembles, J. Amer. Math. Soc. 23 : 535–561.
3Ash and Doleans-Dade (1999) Ash, R. B. and Doleans-Dade, C. A. (1999). Probability and Measure Theory, 2nd Edition , Academic Press, San Diego, CA, USA.
4Ball (1997) Ball, K. (1997). An elementary introduction to modern convex geometry, in S. Levy (ed.), Flavors of Geometry , Vol. 31, Cambridge University Press, Cambridge, UK, pp. 1–58.
5Bhattacharya and Bickel (2012) Bhattacharya, S. and Bickel, P. J. (2012). Adaptive estimation in elliptical distributions with extensions to high-dimensions, Preprint. Available at http://www.science.oregonstate.edu/~bhattash/Site/Research.html .
6Biau and Devroye (2003) Biau, G. and Devroye, L. (2003). On the risk of estimates for block decreasing densities, J. Mult. Anal. 86 : 143–165.
7Billingsley (1995) Billingsley, P. (1995). Probability and Measure , Wiley, New York.
8Bobkov (2003) Bobkov, S. G. (2003). Spectral gap and concentration for some spherically symmetric probability measures, Geometric aspects of functional analysis , Springer, pp. 37–43.