Optimal Kullback-Leibler Aggregation in Mixture Density Estimation by Maximum Likelihood
Arnak S. Dalalyan, Mehdi Sebbar

TL;DR
This paper analyzes the maximum likelihood estimator for mixture density estimation, establishing risk bounds and optimal rates under Kullback-Leibler loss, especially in high-dimensional and sparse settings.
Contribution
It provides sharp oracle inequalities and optimal convergence rates for the MLE in mixture models, including sparse and high-dimensional cases.
Findings
MLE attains the optimal rate ((log K)/n)^{1/2} in convex aggregation.
Under compatibility conditions, the estimator achieves the optimal sparse rate (D log K)/n.
Introduces nearly-D-sparse aggregation and matching lower bounds.
Abstract
We study the maximum likelihood estimator of density of independent observations, under the assumption that it is well approximated by a mixture with a large number of components. The main focus is on statistical properties with respect to the Kullback-Leibler loss. We establish risk bounds taking the form of sharp oracle inequalities both in deviation and in expectation. A simple consequence of these bounds is that the maximum likelihood estimator attains the optimal rate , up to a possible logarithmic correction, in the problem of convex aggregation when the number of components is larger than . More importantly, under the additional assumption that the Gram matrix of the components satisfies the compatibility condition, the obtained oracle inequalities yield the optimal rate in the sparsity scenario. That is, if the weight vector is (nearly)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Optimal Kullback-Leibler Aggregation in Mixture Density Estimation by Maximum Likelihood
Arnak S. Dalalyan and Mehdi Sebbar
Abstract
We study the maximum likelihood estimator of density of independent observations, under the assumption that it is well approximated by a mixture with a large number of components. The main focus is on statistical properties with respect to the Kullback-Leibler loss. We establish risk bounds taking the form of sharp oracle inequalities both in deviation and in expectation. A simple consequence of these bounds is that the maximum likelihood estimator attains the optimal rate , up to a possible logarithmic correction, in the problem of convex aggregation when the number of components is larger than . More importantly, under the additional assumption that the Gram matrix of the components satisfies the compatibility condition, the obtained oracle inequalities yield the optimal rate in the sparsity scenario. That is, if the weight vector is (nearly) -sparse, we get the rate . As a natural complement to our oracle inequalities, we introduce the notion of nearly--sparse aggregation and establish matching lower bounds for this type of aggregation.
1 Introduction
Assume that we observe independent random vectors drawn from a probability distribution that admits a density function with respect to some reference measure . The goal is to estimate the unknown density by a mixture density. More precisely, we assume that for a given family of mixture components , the unknown density of the observations is well approximated by a convex combination of these components, where
[TABLE]
The assumption that the component densities are known essentially means that they are chosen from a dictionary obtained on the basis of previous experiments or expert knowledge.
We focus on the problem of estimation of the density function and the weight vector from the simplex under the sparsity scenario: the ambient dimension can be large, possibly larger than the sample size , but most entries of are either equal to zero or very small.
Our goal is to investigate the statistical properties of the Maximum Likelihood Estimator (MLE), defined by
[TABLE]
where the minimum is computed over a suitably chosen subset of . In the present work, we will consider sets , depending on a parameter and the sample , defined by
[TABLE]
Note that the objective function in (3) is convex and the same is true for set (4). Therefore, the MLE can be efficiently computed even for large by solving a problem of convex programming. To ease notation, very often, we will omit the dependence of on and write instead of .
The quality of an estimator can be measured in various ways. For instance, one can consider the Kullback-Leibler divergence
[TABLE]
which has the advantage of bypassing identifiability issues. One can also consider the (well-specified) setting where for some and measure the quality of estimation through a distance between the vectors and (such as the -norm or the Euclidean norm ).
The main contributions of the present work are the following:
- (a)
We demonstrate that in the mixture model there is no need to introduce sparsity favoring penalty in order to get optimal rates of estimation under the Kullback-Leibler loss in the sparsity scenario. In fact, the constraint that the weight vector belongs to the simplex acts as a sparsity inducing penalty. As a consequence, there is no need to tune a parameter accounting for the magnitude of the penalty. 2. (b)
We show that the maximum likelihood estimator of the mixture density simultaneously attains the optimal rate of aggregation for the Kullback-Leibler loss for at least three types of aggregation: model-selection, convex and -sparse aggregation. 3. (c)
We introduce a new type of aggregation, termed nearly -sparse aggregation that extends and unifies the notions of convex and -sparse aggregation. We establish strong lower bounds for the nearly -sparse aggregation and demonstrate that the maximum likelihood estimator attains this lower bound up to logarithmic factors.
1.1 Related work
The results developed in the present work aim to gain a better understanding (a) of the statistical properties of the maximum likelihood estimator over a high-dimensional simplex and (b) of the problem of aggregation of density estimators under the Kullback-Leibler loss. Various procedures of aggregation111We refer the interested reader to (Tsybakov, 2014) for an up to date introduction into aggregation of statistical procedures. for density estimation have been studied in the literature with respect to different loss functions. (Catoni, 1997; Yang, 2000; Juditsky et al., 2008) investigated different variants of the progressive mixture rules, also known as mirror averaging (Yuditskiĭ et al., 2005; Dalalyan and Tsybakov, 2012), with respect to the Kullback-Leibler loss and established model selection type oracle inequalities222This means that they prove that the expected loss of the aggregate is almost as small as the loss of the best element of the dictionary . in expectation. Same type of guarantees, but holding with high probability, were recently obtained in (Bellec, 2014; Butucea et al., 2016) for the procedure termed -aggregation, introduced in other contexts by (Dai et al., 2012; Rigollet, 2012).
Aggregation of estimators of a probability density function under the -loss was considered in (Rigollet and Tsybakov, 2007), where it was shown that a suitably chosen unbiased risk estimate minimizer is optimal both for convex and linear aggregation. The goal in the present work is to go beyond the settings of the aforementioned papers in that we want simultaneously to do as well as the best element of the dictionary, the best convex combination of the dictionary elements but also the best sparse convex combination. Note that the latter task was coined -aggregation in (Lounici, 2007) (see also (Bunea et al., 2007)). In the present work, we rename it in -sparse aggregation, in order to make explicit its relation to sparsity.
Key differences between the latter work and ours are that we do not assume the sparsity index to be known and we are analyzing an aggregation strategy that is computationally tractable even for large . This is also the case of (Bunea et al., 2010; Bertin et al., 2011), which are perhaps the most relevant references to the present work. These papers deal with the -loss and investigate the lasso and the Dantzig estimators, respectively, suitably adapted to the problem of density estimation. Their methods handle dictionary elements which are not necessarily probability density functions, but has the drawback of requiring the choice of a tuning parameter. This choice is a nontrivial problem in practice. Instead, we show here that the optimal rates of sparse aggregation with respect to the Kullback-Leibler loss can be attained by procedure which is tuning parameter free.
Risk bounds for the maximum likelihood and other related estimators in the mixture model have a long history (Li and Barron, 1999; Li, 1999; Rakhlin et al., 2005). For the sake of comparison we recall here two elegant results providing non-asymptotic guarantees for the Kullback-Leibler loss.
Theorem 1.1** (Theorem 5.1 in (Li, 1999)).**
Let be a finite dictionary of cardinality of density functions such that . Then, the maximum likelihood estimator over , , satisfies the inequality
[TABLE]
Inequality (6) is an inexact oracle inequality in expectation that quantifies the ability of to solve the problem of model-selection aggregation. The adjective inexact refers to the fact that the “bias term” is multiplied by factor strictly larger than one. It is noteworthy that the remainder term corresponds to the optimal rate of model-selection aggregation (Juditsky and Nemirovski, 2000; Tsybakov, 2003). In relation with Theorem 1.1, it is worth mentioning a result of (Yang, 2000) and (Catoni, 1997), see also Theorem 5 in (Lecué, 2006) and Corollary 5.4 in (Juditsky et al., 2008), establishing a risk bound similar to (6) without the extra factor for the so called mirror averaging aggregate.
Theorem 1.2** (page 226 in (Rakhlin et al., 2005)).**
Let be a finite dictionary of cardinality of density functions and let \mathcal{C}_{k}=\big{\{}f_{\bm{\pi}}:\|\bm{\pi}\|_{0}\leq k\big{\}} be the set of all the mixtures of at most elements of (). Assume that and the densities from are bounded from below and above by some positive constants and , respectively. Then, there is a constant depending only on and such that, for any tolerance level , the maximum likelihood estimator over , , satisfies the inequality
[TABLE]
with probability at least .
This result is remarkably elegant and can be seen as an exact oracle inequality in deviation for -sparse aggregation (for ). Furthermore, if we choose in Theorem 1.2, then we get an exact oracle inequality for convex aggregation with a rate-optimal remainder term (Tsybakov, 2003). However, it fails to provide the optimal rate for -sparse aggregation.
Closing this section, we would like to mention the recent work (Xia and Koltchinskii, 2016), where oracle inequalities for estimators of low rank density matrices are obtained. They share a common feature with those obtained in this work: the adaptation to the unknown sparsity or rank is achieved without any additional penalty term. The constraint that the unknown parameter belongs to the simplex acts as a sparsity inducing penalty.
1.2 Additional notation
In what follows, for any , we denote by the vector and by the matrix . We also define , , so that the MLE is the minimizer of the function
[TABLE]
For any set of indices and any , we define as the -dimensional vector whose -th coordinate equals if and [math] otherwise. We denote the cardinality of any by . For any set and any constant , we introduce the compatibility constants (van de Geer and Bühlmann, 2009) of a positive semidefinite matrix ,
[TABLE]
The risk bounds established in the present work involve the factors and . One can easily check that . We also recall that the compatibility constants of a matrix are bounded from below by the smallest eigenvalue of .
Let us fix a function and denote and for . In the results of this work, the compatibility factors are used for the empirical and population Gram matrices of vectors , that is when and with
[TABLE]
The general entries of these matrices are respectively and .
We assume that there exist positive constants and such that for all densities with , we have
[TABLE]
We use the notation . It is worth mentioning that the set of dictionaries satisfying simultaneously this boundedness assumption and the aforementioned compatibility condition is not empty. For instance, one can consider the functions for . These functions are probability densities w.r.t. the Lebesgue measure on . They are bounded from below and from above by and , respectively. Taking , the corresponding Gram matrix is , which has all eigenvalues equal to .
1.3 Agenda
The rest of the paper is organized as follows. In Section 2, we state our main theoretical contributions and discuss their consequences. Possible relaxations of the conditions, as well as lower bounds showing the tightness of the established risk bounds, are considered in Section 3. A brief summary of the paper and some future directions of research are presented in Section 4. The proofs of all theoretical results are postponed to Section 5 and Section 6.
2 Oracle inequalities in deviation and in expectation
In this work, we prove several non-asymptotic risk bounds that imply, in particular, that the maximum likelihood estimator is optimal in model-selection aggregation, convex aggregation and -sparse aggregation (up to -factors). In all the results of this section we assume the parameter in (4) to be equal to [math].
Theorem 2.1**.**
Let be a set of densities satisfying the boundedness condition (12). Denote by the mixture density corresponding to the maximum likelihood estimator over defined in (8). There are constants , and such that, for any , the following inequalities hold
[TABLE]
with probability at least .
The proof of this and the subsequent results stated in this section are postponed to Section 5. Comparing the two inequalities of the above theorem, one can notice two differences. First, the term proportional to is absent in the second risk bound, which means that the risk of the MLE is compared to that of the best mixture with a weight sequences supported by . Hence, this risk bound is weaker than the first one provided by (13). Second, the compatibility factor in (14) is larger that its counterpart in (13). This entails that in the cases where the oracle is expected to be sparse, the remainder term of the bound in (13) is slightly looser than that of (14).
A first and simple consequence of Theorem 1.1 is obtained by taking in the right hand side of the first inequality. Then, and we get
[TABLE]
This implies that for every dictionary , without any assumption on the smallness of the coherence between its elements, the maximum likelihood estimator achieves the optimal rate of convex aggregation, up to a possible333In fact, the optimal rate of convex aggregation when is of order \normalsize\big{(}\nicefrac{{\log(K/n^{\nicefrac{{1}}{{2}}})}}{{\displaystyle n}}\big{)}^{\nicefrac{{1}}{{2}}}. Therefore, even the term is optimal whenever for some . logarithmic correction, in the high-dimensional regime . In the case of regression with random design, an analogous result has been proved by Lecué and Mendelson (2013) and Lecué (2013). One can also remark that the upper bound in (15) is of the same form as the one of Theorem 1.2 stated in section 1.1 above.
The main compelling feature of our results is that they show that the MLE adaptively achieves the optimal rate of aggregation not only in the case of convex aggregation, but also for the model-selection aggregation and -(convex) aggregation. For handling these two cases, it is more convenient to get rid of the presence of the compatibility factor of the empirical Gram matrix . The latter can be replaced by the compatibility factor of the population Gram matrix, as stated in the next result.
Theorem 2.2**.**
Let be a set of densities satisfying the boundedness condition (12). Denote by the mixture density corresponding to the maximum likelihood estimator over defined in (8). There are constants , and such that, for any , the following inequalities hold
[TABLE]
with probability at least .
The main advantage of the upper bounds provided by Theorem 2.2 as compared with those of Theorem 2.1 is that the former is deterministic, whereas the latter involves the compatibility factor of the empirical Gram matrix which is random. The price to pay for getting rid of randomness in the risk bound is the increased values of the constants , and . Note, however, that this price is not too high, since obviously and, therefore, , and . In addition, the absence of randomness in the risk bound allows us to integrate it and to convert the bound in deviation into a bound in expectation.
Theorem 2.3** (Bound in Expectation).**
Let be a set of densities satisfying the boundedness condition (12). Denote by the mixture density corresponding to the maximum likelihood estimator over defined in (8). There are constants , and such that
[TABLE]
In inequality (19), upper bounding the infimum over all sets by the infimum over the singletons, we get
[TABLE]
This implies that the maximum likelihood estimator achieves the rate in model-selection type aggregation. This rate is known to be optimal in the model of regression (Rigollet, 2012). If we compare this result with Theorem 1.1 stated in Section 1.1, we see that the remainder terms of these two oracle inequalities are of the same order (provided that the compatibility factor is bounded away from zero), but inequality (20) has the advantage of being exact.
We can also apply (19) to the problem of convex aggregation with small dictionary, that is for smaller than . Upper bounding by , we get
[TABLE]
Assuming, for instance, the smallest eigenvalue of bounded away from zero (which is a quite reasonable assumption in the context of low dimensionality), the above upper bound provides a rate of convex aggregation of the order of . Up to a logarithmic term, this rate is known to be optimal for convex aggregation in the model of regression.
Finally, considering all the sets of cardinal smaller than (with ) and setting , we deduce from (19) that
[TABLE]
According to (Rigollet and Tsybakov, 2011, Theorem 5.3), in the regression model, the optimal rate of -sparse aggregation is of order , whenever . Inequality (22) shows that the maximum likelihood estimator over the simplex achieves this rate up to a logarithmic factor. Furthermore, this logarithmic inflation disappears when the sparsity is such that, asymptotically, the ratio is bounded from above by a constant . Indeed, in such a situation the optimal rate is of the same order as the remainder term in (22), that is .
3 Discussion of the conditions and possible extensions
In this section, we start by announcing lower bounds for the Kullback-Leibler aggregation in the problem of density estimation. Then we discuss the implication of the risk bounds of the previous section to the case where the target is the weight vector rather than the mixture density . Finally, we present some extensions to the case where the boundedness assumption is violated.
3.1 Lower bounds for nearly--sparse aggregation
As mentioned in previous section, the literature is replete with lower bounds on the minimax risk for various types of aggregation. However most of them concern the regression setting either with random or with deterministic design. Lower bounds of aggregation for density estimation were first established by Rigollet (2006) for the -loss. In the case of Kullback-Leibler aggregation in density estimation, the only lower bounds we are aware are those established by Lecué (2006) for model-selection type aggregation. It is worth emphasizing here that the results of the aforementioned two papers provide weak lower bounds. Indeed, they establish the existence of a dictionary for which the minimax excess risk is lower bounded by the suitable quantity. In contrast with this, we establish here strong lower bounds that hold for every dictionary satisfying the boundedness and the compatibility conditions.
Let be a dictionary of density functions on . We say that the dictionary satisfies the boundedness and the compatibility assumptions if for some positive constants and , we have for all , . In addition, we assume in this subsection that all the eigenvalues of the Gram matrix belong to the interval , with and .
For every and any , we define the set of nearly--sparse convex combinations of the dictionary elements by
[TABLE]
In simple words, belongs to if it admits a -approximately -sparse representation in the dictionary . We are interested in bounding from below the minimax excess risk
[TABLE]
where the is over all possible estimators of and the is over all density functions over . Note that the estimator is not necessarily a convex combination of the dictionary elements. Furthermore, it is allowed to depend on the parameters and characterizing the class . It follows from (18), that if the dictionary satisfies the boundedness and the compatibility condition, then
[TABLE]
for some constant depending only on and . Note that the last term accounts for the following phenomenon: If the sparsity index is larger than a multiple of , then the sparsity bears no advantage as compared to the constraint. The next result implies that this upper bound is optimal, at least up to logarithmic factors.
Theorem 3.1**.**
Assume that . Let and be fixed. There exists a constant depending only on , , and such that
[TABLE]
This is the first result providing lower bounds on the minimax risk of aggregation over nearly--sparse aggregates. To the best of our knowledge, even in the Gaussian sequence model, such a result has not been established to date. It has the advantage of unifying the results on convex and -sparse aggregation, as well as extending them to a more general class. Let us also stress that the condition is natural and unavoidable, since it ensures that the right hand side of (25) is smaller than the trivial bound .
3.2 Weight vector estimation
The risk bounds carried out in the previous section for the problem of density estimation in the Kullback-Leibler loss imply risk bounds for the problem of weight vector estimation. Indeed, under the boundedness assumption (12), the Kullback-Leibler divergence between two mixture densities can be shown to be equivalent to the squared Mahalanobis distance between the weight vectors of these mixtures with respect to the Gram matrix. In order to go from the Mahalanobis distance to the Euclidean one, we make use of the restricted eigenvalue
[TABLE]
This strategy leads to the next result.
Proposition 1**.**
Let be a set of densities satisfying condition (12). Denote by the mixture density corresponding to the maximum likelihood estimator over defined in (8). Let the weight-vector of the best mixture density: , and let be the support of . There are constants and such that, for any , the following inequalities hold
[TABLE]
with probability at least .
In simple words, this result tells us that the wight estimator attains the minimax rate of estimation over the intersection of the and balls, when the error is measured by the -norm, provided that the compatibility factor of the dictionary is bounded away from zero. The optimality of this rate—up to logarithmic factors—follows from the fact that the error of estimation of each nonzero coefficients of is at least (for some ), leading to a sum of the absolute values of the errors at least of the order . The logarithmic inflation of the rate is the price to pay for not knowing the support . It is clear that this reasoning is valid only when the sparsity is of smaller order than . Indeed, in the case , the trivial bound is tighter than the one in (28).
Concerning the risk measured by the Euclidean norm, we underline that there are two regimes characterized by the order between upper bounds in (29) and (30). Roughly speaking, when the signal is highly sparse in the sense that is smaller than , then the smallest bound is given by (29) and is of the order . This rate is can be compared to the rate , known to be optimal in the Gaussian sequence model. In the second regime corresponding to mild sparsity, , the smallest bound is the one in (30). The latter is of order , which is known to be optimal in the Gaussian sequence model. For various results providing lower bounds in regression framework we refer the interested reader to (Raskutti et al., 2011; Rigollet and Tsybakov, 2011; Wang et al., 2014).
3.3 Extensions to the case of vanishing components
In the previous sections we have deliberately avoided any discussion of the role of the parameter , present in the search space of the problem (3)-(4). In fact, when all the dictionary elements are separated from zero by a constant , a condition assumed throughout previous sections, choosing any value of is equivalent to choosing . Therefore, the choice of this parameter does not impact the quality of estimation. However, this parameter might have strong influence in practice both on statistical and computational complexity of the maximum likelihood estimator. A first step in understanding the influence of on the statistical complexity is made in the next paragraphs.
Let us consider the case where the condition fails, but the upper-boundedness condition holds true. In such a situation, we replace the definition by . We also define the set \Pi^{*}(\mu)=\big{\{}\bm{\pi}\in{\mathbb{B}}^{K}_{+}:P^{*}\big{(}f_{\bm{\pi}}(\bm{X})\geq\mu\big{)}=1\big{\}}. In order to keep mathematical formulae simple, we will only state the equivalent of (14) in the case of . All the other results of the previous section can be extended in a similar way.
Proposition 2**.**
Let be a set of densities satisfying the boundedness condition . Denote by the mixture density corresponding to the maximum likelihood estimator over defined in (8). There is a constant such that, for any ,
[TABLE]
on an event of probability at least . Furthermore, if , then, on the same event, we have
[TABLE]
The last term present in the first upper bound, is the price we pay for considering densities that are not lower bounded by a given constant. A simple, non-random upper bound on this term is . Providing a tight upper bound on this kind or remainder terms is an important problem which lies beyond the scope of the present work.
4 Conclusion
In this paper, we have established exact oracle inequalities for the maximum likelihood estimator of a mixture density. This oracle inequality clearly highlights the interplay of three sources of error: misspecification of the model of mixture, departure from -sparsity and stochastic error of estimating nonzero coefficients. We have also proved a lower bound that show that the remainder terms of our upper bounds are optimal, up to logarithmic terms. This lower bound is valid not only for the maximum likelihood estimator, but for any estimator of the density function. As a consequence, the maximum likelihood estimator has a nearly optimal excess risk in the minimax sense.
In all the results of the present paper, we have assumed that the components of the mixture model are deterministic. From a practical point of view, it might be reasonable to choose these components in a data driven way, using, for instance, a hold-out sample. This question, as well as the problem of tuning the parameter , constitute interesting and challenging avenues for future research.
5 Proofs of results stated in previous sections
This section collects the proofs of the theorems and claims stated in previous sections.
5.1 Proof of Theorem 2.1
The main technical ingredients of the proof are a strong convexity argument and a control of the maximum of an empirical process. The corresponding results are stated in Lemma 5.2 and 5.1, respectively, deferred to Section 5.6. We denote by the matrix .
Since is a minimizer of , see (3) and (8), we know that for every . However, this inequality can be made sharper using the (local) strong convexity of the function . Indeed, Lemma 5.2 below shows that
[TABLE]
On the other hand, if we set , we have and
[TABLE]
Combining inequalities (33) and (34), we get
[TABLE]
The next step of the proof consists in establishing a suitable upper bound on the noise term where
[TABLE]
According to the mean value theorem, setting \zeta_{n}:=\sup_{\bar{\bm{\pi}}\in\bm{\Pi}_{n}}\big{\|}\nabla\Phi_{n}(\bar{\bm{\pi}})\big{\|}_{\infty}, for every vector , it holds that
[TABLE]
This inequality, combined with (35), yields
[TABLE]
Using the Gram matrix , the quantity can be rewritten as
[TABLE]
We proceed with applying the following result (Bellec et al., 2016, Lemma 2).
Lemma 5.1** (Bellec et al. (2016), Lemma 2).**
For any pair of vectors , for any pair of scalars and , for any symmetric matrix and for any set , the following inequality is true
[TABLE]
where .
Choosing , and (thus ) we get the inequality
[TABLE]
One can check that . Combining the last inequality with (38), we arrive at
[TABLE]
Since the last inequality holds for every , we can insert an in the right hand side. Furthermore, in view of 5.1 below, with probability larger than , is bounded from above by . This completes the proof of (13).
To prove (14), we follow the same steps as above up to inequality (38). Then, we remark that for every in the simplex satisfying , it holds
[TABLE]
Therefore, we have with probability at least
[TABLE]
Replacing the right hand term in (38) and taking the infimum, we get the claim of the corollary. Since, in view of 5.1 below, with probability larger than , is bounded from above by , we get the claim of (14).
5.2 Proof of Theorem 2.2
Let us denote . According to (38) and (39), we have
[TABLE]
As is the difference of two vectors lying on the simplex, we have . Let stand for the largest (in absolute values) element of the matrix . We have
[TABLE]
Setting , we get
[TABLE]
Following the same steps as those used for obtaining (42), we arrive at
[TABLE]
The last step consists in evaluating the quantiles of the random variable . To this end, one checks that the Hoeffding inequality combined with the union bound yields
[TABLE]
In other terms, for every , we have
[TABLE]
Note that for , we have . Combining with 5.1, this implies that \bar{\zeta}_{n}\leq(8V^{3}+1)\big{(}\frac{\log(K/\delta)}{n}\big{)}^{\nicefrac{{1}}{{2}}} with probability larger than . This completes the proof of (16). The proof of (17) is omitted since it repeats the same arguments as those used for proving (14).
5.3 Proof of Theorem 2.3
According to (51), for any and any , we have
[TABLE]
Recall now that and, according to 5.1, we have
[TABLE]
Using Theorem 6.2, one easily checks that
[TABLE]
This implies that
[TABLE]
Similarly, in view of the Efron-Stein inequality, we have . This implies that
[TABLE]
Combining (57), (60) and (54), we get the desired result.
5.4 Proof of Proposition 1
Using the strong convexity of the function over the interval and the fact that minimizes the convex function , we get
[TABLE]
Combining with (50), in which we replace by , we get
[TABLE]
Let us set . If , then the claims are trivial. In the rest of this proof, we assume . In view of (43), we have . Therefore, using the definition of the compatibility factor, we get
[TABLE]
We have already checked that \bar{\zeta}_{n}\leq(8V^{3}+1)\big{(}\frac{\log(K/\delta)}{n}\big{)}^{\nicefrac{{1}}{{2}}} with probability larger than . Dividing both sides of inequality (63) by and using the aforementioned upper bound on , we get the desired bound on .
In order to bound the error in the Euclidean norm, we denote by the set of indices corresponding to largest entries of the vector . Since , we clearly have . Therefore,
[TABLE]
Combining this inequality with the definition of the restricted eigenvalue and inequality (62) above, we arrive at
[TABLE]
Dividing both sides by , taking the square and using (67), we get
[TABLE]
This inequality, in conjunction with the upper bound on used above, completes the proof of the second claim.
5.5 Proof of Proposition 2
We repeat the proof of Theorem 2.1 with some small modifications. First of all, we replace the function by the function
[TABLE]
One easily checks that this function is twice continuously differentiable with a second derivative satisfying for every . Furthermore, since for every , we have , where we have used the notation . Therefore, similarly to (33), we get
[TABLE]
for every . Let us define and . We have
[TABLE]
Notice that implies that and that . Therefore, along the lines of the proof of (14) (see, namely, (46)), we get
[TABLE]
We can repeat now the arguments of 5.1 with some minor modifications. We first rewrite as with . One checks that the bounded difference inequality and the Efron-Stein inequality can be applied with an additional factor 2, since for , we have
[TABLE]
Therefore, for every , with probability larger than , we have and . By the union bound, we obtain that with probability larger than , . Thus, to upper bound , we use the symmetrization argument:
[TABLE]
Note that the function , the derivative of defined in (70), is by construction Lipschitz with constant . Therefore, in view of the contraction principle,
[TABLE]
As a consequence, we proved that with probability larger than , we have . This completes the proof of the first inequality. In order to prove the second one, we simply change the way we have evaluated the term in the left hand side of (72). Since is strongly convex with a second order derivative bounded from below by , we have . Since is always larger than , the derivative equals . Integrating over , we get the second inequality of the proposition.
5.6 Auxiliary results
We start by a general convex result based on the strong convexity of the function to derive a bound on the estimated log-likelihood.
Lemma 5.2**.**
Let us assume that . Then, for any , it holds that
[TABLE]
Proof.
Recall that minimizes the function defined in (8) over . Furthermore, the function is clearly strongly convex with a second order derivative bounded from below by over the set . Therefore, for every , the function given by:
[TABLE]
is convex. This implies that the mapping
[TABLE]
is convex over the set . This yields444We denote by the sub-differential of a convex function .
[TABLE]
Using the Karush-Kuhn-Tucker conditions and the fact that minimizes , we get . This readily gives , for any . The last step is to remark that , since both and have entries summing to one. ∎
The core of our results lies in the following proposition which bound the deviations of the empirical process part.
Proposition 5.1** (Supremum of Empirical Process).**
For any and , define and consider . If , then for any , with probability at least , we have
[TABLE]
Furthermore, we have \mathbf{E}[\zeta_{n}]\leq 4V^{3}\big{(}\frac{2\log(2K^{2})}{n}\big{)}^{\nicefrac{{1}}{{2}}} and .
Proof.
To ease notation, let us denote g_{\bm{\pi},l}(x)=\frac{f_{l}(x)}{f_{\bm{\pi}}(x)}-\mathbf{E}\big{[}\frac{f_{l}(\bm{X})}{f_{\bm{\pi}}(\bm{X})}\big{]} and
[TABLE]
where . To derive a bound on , we will use the McDiarmid concentration inequality that requires the bounded difference condition to hold for . For some , let be a new sample obtained from by modifying the -th element and by leaving all the others unchanged. Then, we have
[TABLE]
where the last inequality is a direct consequence of assumption (12). Therefore, using the McDiarmid concentration inequality recalled in Theorem 6.3 below, we check that the inequality
[TABLE]
holds with probability at least . Furthermore, in view of the Efron-Stein inequality, we have
[TABLE]
Let us denote and the Rademacher complexity of given by
[TABLE]
with independent and identically distributed Rademacher random variables independent of . Using the symmetrization inequality (see, for instance, Theorem 2.1 in Koltchinskii (2011)) we have
[TABLE]
Lemma 5.3**.**
The Rademacher complexity defined in (93) satisfies
[TABLE]
Proof.
The proof relies on the contraction principle of Ledoux and Talagrand (1991) that we recall in Appendix A: Concentration inequalities for the convenience. We apply this principle to the random variables and to the function . Clearly is Lipschitz on with the Lipschitz constant equal to and . Therefore
[TABLE]
Expanding we obtain
[TABLE]
We apply now Theorem 6.2 with , , , and Y_{i,s}=\epsilon_{i}\big{(}\frac{f_{k}(\bm{X}_{i})}{f_{l}(\bm{X}_{i})}-1\big{)}. This yields
[TABLE]
This completes the proof of the lemma. ∎
Combining inequalities (91,94) and Lemma 5.3, we get that the inequality
[TABLE]
holds with probability at least . Noticing that and, for , we have , we get the first claim of the proposition. The second claim is a direct consequence of Lemma 5.3 and (94). ∎
6 Proof of the lower bound for nearly--sparse aggregation
We prove the minimax lower bound for estimation in Kullback-Leibler risk using the following slightly adapted version of Theorem 2.5 from Tsybakov (2009). Throughout this section, we denote by and , respectively, the smallest and the largest eigenvalue of all principal minors of the matrix .
Theorem 6.1**.**
For some integer assume that contains elements satisfying the following two conditions.
- (i)
, for all pairs such that . 2. (ii)
For product densities defined on by it holds
[TABLE]
Then
[TABLE]
To establish the bound claimed in Theorem 3.1, we will split the problem into two parts, corresponding to the following two subsets of
[TABLE]
We will show that over , we have a lower bound of order while over , a lower bound of order \big{[}\frac{\gamma^{2}}{n}\log\big{(}1+K/(\gamma\sqrt{n})\big{)}\big{]}^{\nicefrac{{1}}{{2}}} holds true. Therefore, the lower bound over is larger than the average of these bounds.
For any and , let be the subset of defined by
[TABLE]
Before starting, we remind here a version of the Varshamov-Gilbert lemma (see, for instance, (Rigollet and Tsybakov, 2011, Lemma 8.3)) which will be helpful for deriving our lower bounds.
Lemma 6.1**.**
Let and be two integers. Then there exist a subset and an absolute constant such that
[TABLE]
and satisfies and
[TABLE]
We will also use the following lemma that allows us to relate the KL-divergence to the Euclidean distance between the weight vectors and .
Lemma 6.2**.**
If the dictionary satisfies the boundedness assumption (12), then for any we have
[TABLE]
Proof.
Using the Taylor expansion, one can check that for any , we have . Therefore,
[TABLE]
Since satisfies the boundedness assumption, we get
[TABLE]
The claim of the lemma follows from these inequalities and the fact that \int_{\mathcal{X}}\big{(}f_{\bm{\pi}^{\prime}}-f_{\bm{\pi}}\big{)}^{2}d\nu=\|\bm{\Sigma}^{\nicefrac{{1}}{{2}}}(\bm{\pi}^{\prime}-\bm{\pi})\|_{2}^{2}. ∎
6.1 Lower bound on
We show here that the lower bound {(\nicefrac{{D}}{{n}})\log(1+\nicefrac{{eK}}{{D}})}\wedge\big{(}(\nicefrac{{1}}{{n}}){\log(1+\nicefrac{{K}}{{\sqrt{n}}})}\big{)}^{\nicefrac{{1}}{{2}}} holds when we consider the worst case error for belonging to the set .
Proposition 3**.**
If then, for the constant
[TABLE]
we have
[TABLE]
Proof.
We assume that . The case can be reduced to the case by using the inclusion . Let us set and denote by the largest integer such that
[TABLE]
According to Lemma 6.1, there exists a subset of of cardinality satisfying such that for any pair of distinct elements , we have . Using these binary vectors , we define the set as follows:
[TABLE]
Clearly, for every , the vectors belong to . Furthermore, for any pair of distinct values , we have . In view of Lemma 6.2, this yields
[TABLE]
Let us choose
[TABLE]
It follows from (114) that . Inserting this value of in (116), we get
[TABLE]
This shows that condition (i) of Theorem 6.1 is satisfied with . For the second condition of the same theorem, we have
[TABLE]
since one can check that . Therefore, using the definition of , we get
[TABLE]
Theorem 6.1 implies that
[TABLE]
We use the fact that is the largest integer satisfying (114). Therefore, either or
[TABLE]
If , then the claim of the proposition follows from (124), since . On the other hand, if (125) is true, then
[TABLE]
In addition, implies that . Combining the last two inequalities, we get the inequality d\log(1+eK/d)\geq\nicefrac{{1}}{{2}}\big{(}{A_{1}n}{\log(1+eK/\sqrt{A_{1}n})}\big{)}^{\nicefrac{{1}}{{2}}}\geq\big{(}{n}{\log(1+eK/\sqrt{n})}\big{)}^{\nicefrac{{1}}{{2}}}. Therefore, in view of (124), we get the claim of the proposition. ∎
6.2 Lower bound on
Next result shows that the lower bound {\frac{\gamma^{2}}{n}\log\big{(}1+\frac{K}{\gamma\sqrt{n}}\big{)}} holds for the worst case error when belongs to the set .
Proposition 4**.**
Assume that
[TABLE]
Then, for the constant , it holds that
[TABLE]
Proof.
Let be a constant the precise value of which will be specified later. Denote by the largest integer satisfying
[TABLE]
Note that in view of the condition of the proposition. This readily implies that and, therefore,
[TABLE]
Let us first consider the case . According to Lemma 6.1, there exists a subset of cardinality satisfying \log L\geq C_{1}\log\big{(}1+\frac{e(K-1)}{d}\big{)} and for any pair of distinct elements taken from . With these binary vectors in hand, we define the set of cardinality as follows:
[TABLE]
It is clear that all the vectors of belong to . Let us fix now an element of and denote it by , the corresponding element of being denoted by . We have
[TABLE]
The definition of yields , which implies that
[TABLE]
Combined with eq. 134, this implies that
[TABLE]
Choosing
[TABLE]
we get that \max_{\bm{\pi}\in\mathcal{D}}{\rm KL}(f^{n}_{\bm{\pi}}||f^{n}_{\bm{\pi}^{1}})\leq\frac{1}{16}C_{1}{d\log\big{(}1+e(K-1)/d\big{)}}\leq\frac{\log L}{16}.
Furthermore, for any , in view of Lemma 6.2 and (130), we have
[TABLE]
Since , this implies that Theorem 6.1 can be applied, which leads to the inequality
[TABLE]
To complete the proof of the proposition, we have to consider the case . In this case, we can repeat all the previous arguments for and get the desired inequality. ∎
6.3 Lower bound holding for all densities
Now that we have lower bounds in probability for and , we can derive a lower bound in expectation for . In particular, to prove Theorem 3.1, we will use the inequality
[TABLE]
Proof of Theorem 3.1.
To ease notation, let us define
[TABLE]
We first consider the case where the dominating term is the first one, that is
[TABLE]
On the one hand, since , we have
[TABLE]
On the other hand, using the inequality , we get
[TABLE]
Combining (144), (145) and (148), we get
[TABLE]
This implies that we can apply Proposition 4, which yields
[TABLE]
In view of (144), this implies that
[TABLE]
We now consider the second case, where the dominating term in the rate is the second one, that is
[TABLE]
In view of Proposition 3, we have
[TABLE]
In view of (152), we get
[TABLE]
Thus, we have proved that implies that \inf_{\widehat{f}}\sup_{f\in\mathcal{H}_{\mathcal{F}}(\gamma,D)}\mathbf{P}\!_{f}\big{(}{\rm KL}(f||\widehat{f})\geq C_{4}\,r(n,K,\gamma,D)\big{)}\geq 0.17 for some constant , whatever the relation between and . The desired lower bound follows now from the Tchebychev inequality \mathbf{E}\big{[}{\rm KL}(f||\widehat{f})\big{]}\geq C_{4}\,r(n,K,\gamma,D)\mathbf{P}\!_{f}\big{(}{\rm KL}(f||\widehat{f})\geq C_{4}\,r(n,K,\gamma,D)\big{)}. ∎
Appendix A: Concentration inequalities
This section contains some well-known results, which are recalled here for the sake of the self-containedness of the paper.
Theorem 6.2**.**
For each , let be independent and zero mean random variables such that for some real numbers we have for all and . Then, we have
[TABLE]
Proof.
We denote for and for . For every , the logarithmic moment generating function satisfies
[TABLE]
where the last inequality is a consequence of the Hoeffding lemma (see, for instance, Lemma 2.2 in (Boucheron et al., 2013)). This means that is sub-Gaussian with variance-factor . Therefore, Theorem 2.5 from (Boucheron et al., 2013) yields , which completes the proof. ∎
We group and state together the bounded differences and the Efron-Stein inequalities (Boucheron et al. (2013), Theorems 6.2 and 3.1, respectively).
Theorem 6.3**.**
Assume that a function f satisfies the bounded difference condition: there exist constants , such that for all , all and where only the vector is changed
[TABLE]
Denote
[TABLE]
Let where are independent. Then, for every ,
[TABLE]
Next we state the contraction principle of (Ledoux and Talagrand, 1991); a proof can be found in (Boucheron et al. (2013), Theorem 11.6).
Theorem 6.4**.**
Let be vectors whose real-valued components are indexed by , that is, . For each let be a -Lipschitz function such that . Let be independent Rademacher random variables, and let be a non-decreasing convex function. Then
[TABLE]
Acknowledgments
The work of M.S. was partially supported by the French “Agence Nationale de la Recherche”, CIFRE no 2014/0517, and by ARTEFACT (www.artefact.is). The work of A.D. was partially supported by the grant Investissements d’Avenir (ANR-11-IDEX-0003/Labex Ecodec/ANR-11-LABX-0047) and the chair “LCL/GENES/Fondation du risque, Nouveaux enjeux pour nouvelles données”.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Bellec [2014] P. C. Bellec. Optimal exponential bounds for aggregation of density estimators. Technical report, ar Xiv:1405.3907, May 2014.
- 2Bellec et al. [2016] Pierre C. Bellec, Arnak S. Dalalyan, Edwin Grappin, and Quentin Paris. On the prediction loss of the lasso in the partially labeled setting. Technical report, ar Xiv:1606.06179, June 2016.
- 3Bertin et al. [2011] K. Bertin, E. Le Pennec, and V. Rivoirard. Adaptive Dantzig density estimation. Ann. Inst. Henri Poincaré Probab. Stat. , 47(1):43–74, 2011.
- 4Boucheron et al. [2013] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence . OUP Oxford, 2013. ISBN 9780199535255.
- 5Bunea et al. [2007] Florentina Bunea, Alexandre B. Tsybakov, and Marten H. Wegkamp. Aggregation for gaussian regression. Ann. Statist. , 35(4):1674–1697, 08 2007.
- 6Bunea et al. [2010] Florentina Bunea, Alexandre B. Tsybakov, Marten H. Wegkamp, and Adrian Barbu. Spades and mixture models. Ann. Statist. , 38(4):2525–2558, 2010.
- 7Butucea et al. [2016] C. Butucea, J.-F. Delmas, A. Dutfoy, and R. Fischer. Optimal exponential bounds for aggregation of estimators for the Kullback-Leibler loss. Technical report, ar Xiv:1601.05686, January 2016.
- 8Catoni [1997] O Catoni. The mixture approach to universal model selection. Technical report, 1997. URL http://cds.cern.ch/record/461892 .
