Optimal Kullback-Leibler Aggregation in Mixture Density Estimation by   Maximum Likelihood

Arnak S. Dalalyan; Mehdi Sebbar

arXiv:1701.05009·math.ST·January 19, 2017

Optimal Kullback-Leibler Aggregation in Mixture Density Estimation by Maximum Likelihood

Arnak S. Dalalyan, Mehdi Sebbar

PDF

TL;DR

This paper analyzes the maximum likelihood estimator for mixture density estimation, establishing risk bounds and optimal rates under Kullback-Leibler loss, especially in high-dimensional and sparse settings.

Contribution

It provides sharp oracle inequalities and optimal convergence rates for the MLE in mixture models, including sparse and high-dimensional cases.

Findings

01

MLE attains the optimal rate ((log K)/n)^{1/2} in convex aggregation.

02

Under compatibility conditions, the estimator achieves the optimal sparse rate (D log K)/n.

03

Introduces nearly-D-sparse aggregation and matching lower bounds.

Abstract

We study the maximum likelihood estimator of density of $n$ independent observations, under the assumption that it is well approximated by a mixture with a large number of components. The main focus is on statistical properties with respect to the Kullback-Leibler loss. We establish risk bounds taking the form of sharp oracle inequalities both in deviation and in expectation. A simple consequence of these bounds is that the maximum likelihood estimator attains the optimal rate $((lo g K) / n)^{1/2}$ , up to a possible logarithmic correction, in the problem of convex aggregation when the number $K$ of components is larger than $n^{1/2}$ . More importantly, under the additional assumption that the Gram matrix of the components satisfies the compatibility condition, the obtained oracle inequalities yield the optimal rate in the sparsity scenario. That is, if the weight vector is (nearly)…

Equations287

f_{\bm{\pi}}(\bm{x})=\sum_{j=1}^{K}\pi_{j}f_{j}(\bm{x}),\quad\bm{\pi}\in{\mathbb{B}}_{+}^{K}=\Big{\{}\bm{\pi}\in[0,1]^{K}:\sum_{j=1}^{K}\pi_{j}=1\Big{\}}.

f_{\bm{\pi}}(\bm{x})=\sum_{j=1}^{K}\pi_{j}f_{j}(\bm{x}),\quad\bm{\pi}\in{\mathbb{B}}_{+}^{K}=\Big{\{}\bm{\pi}\in[0,1]^{K}:\sum_{j=1}^{K}\pi_{j}=1\Big{\}}.

\widehat{\bm{\pi}}\in\operatorname*{arg\,min}_{\bm{\pi}\in\Pi}\big{\{}-\frac{1}{n}\sum_{i=1}^{n}\log f_{\bm{\pi}}(\bm{X}_{i})\big{\}},

\widehat{\bm{\pi}}\in\operatorname*{arg\,min}_{\bm{\pi}\in\Pi}\big{\{}-\frac{1}{n}\sum_{i=1}^{n}\log f_{\bm{\pi}}(\bm{X}_{i})\big{\}},

\Pi_{n}(\mu)=\bigg{\{}\bm{\pi}\in{\mathbb{B}}_{+}^{K}:\min_{i\in[n]}\sum_{j=1}^{K}\pi_{j}f_{j}(\bm{X}_{i})\geq\mu\bigg{\}}.

\Pi_{n}(\mu)=\bigg{\{}\bm{\pi}\in{\mathbb{B}}_{+}^{K}:\min_{i\in[n]}\sum_{j=1}^{K}\pi_{j}f_{j}(\bm{X}_{i})\geq\mu\bigg{\}}.

{\rm KL}(f^{*}||f_{\widehat{\bm{\pi}}})=\begin{cases}\int_{\mathcal{X}}f^{*}(\bm{x})\,\log\frac{f^{*}(\bm{x})}{f_{\widehat{\bm{\pi}}}(\bm{x})}\,\nu(d\bm{x}),&\text{ if }P^{*}\big{(}f^{*}(\bm{X})=0\text{ and }f_{\widehat{\bm{\pi}}}(\bm{X})>0\big{)}=0,\\ +\infty,&\text{otherwise},\end{cases}

{\rm KL}(f^{*}||f_{\widehat{\bm{\pi}}})=\begin{cases}\int_{\mathcal{X}}f^{*}(\bm{x})\,\log\frac{f^{*}(\bm{x})}{f_{\widehat{\bm{\pi}}}(\bm{x})}\,\nu(d\bm{x}),&\text{ if }P^{*}\big{(}f^{*}(\bm{X})=0\text{ and }f_{\widehat{\bm{\pi}}}(\bm{X})>0\big{)}=0,\\ +\infty,&\text{otherwise},\end{cases}

\mathbf{E}_{f^{*}}\big{[}{\rm KL}\big{(}f^{*}||\widehat{f}^{\rm ML}_{\mathcal{F}}\big{)}\big{]}\leq\big{(}2+\log V\big{)}\bigg{(}\min_{f\in\mathcal{F}}{\rm KL}(f^{*}||f)+\frac{2\log K}{n}\bigg{)}.

\mathbf{E}_{f^{*}}\big{[}{\rm KL}\big{(}f^{*}||\widehat{f}^{\rm ML}_{\mathcal{F}}\big{)}\big{]}\leq\big{(}2+\log V\big{)}\bigg{(}\min_{f\in\mathcal{F}}{\rm KL}(f^{*}||f)+\frac{2\log K}{n}\bigg{)}.

{\rm KL}\big{(}f^{*}||\widehat{f}^{\rm ML}_{\mathcal{C}_{k}}\big{)}\leq\min_{f\in\mathcal{C}_{k}}{\rm KL}(f^{*}||f)+C\Big{(}\frac{\log(K/\delta)}{n}\Big{)}^{\nicefrac{{1}}{{2}}}

{\rm KL}\big{(}f^{*}||\widehat{f}^{\rm ML}_{\mathcal{C}_{k}}\big{)}\leq\min_{f\in\mathcal{C}_{k}}{\rm KL}(f^{*}||f)+C\Big{(}\frac{\log(K/\delta)}{n}\Big{)}^{\nicefrac{{1}}{{2}}}

L_{n}(\bm{\pi})=\frac{1}{n}\sum_{i=1}^{n}\ell\big{(}\bm{Z}_{i}^{\top}\bm{\pi}\big{)}.

L_{n}(\bm{\pi})=\frac{1}{n}\sum_{i=1}^{n}\ell\big{(}\bm{Z}_{i}^{\top}\bm{\pi}\big{)}.

κ_{A} (J, c)

κ_{A} (J, c)

\overset{κ}{ˉ}_{A} (J, c)

Σ_{n} = \frac{1}{n} i = 1 \sum n \overset{ˉ}{Z}_{i} \overset{ˉ}{Z}_{i}^{⊤}, Σ = E [\overset{ˉ}{Z}_{1} \overset{ˉ}{Z}_{1}^{⊤}] .

Σ_{n} = \frac{1}{n} i = 1 \sum n \overset{ˉ}{Z}_{i} \overset{ˉ}{Z}_{i}^{⊤}, Σ = E [\overset{ˉ}{Z}_{1} \overset{ˉ}{Z}_{1}^{⊤}] .

\forall x \in X, m \leq f_{k} (x) \leq M .

\forall x \in X, m \leq f_{k} (x) \leq M .

KL (f^{*} ∣∣ f_{π})

KL (f^{*} ∣∣ f_{π})

KL (f^{*} ∣∣ f_{π})

{\rm KL}(f^{*}||f_{\widehat{\bm{\pi}}})\leq\inf_{\bm{\pi}\in{\mathbb{B}}_{+}^{K}}{\rm KL}(f^{*}||f_{\bm{\pi}})+c_{1}\Big{(}\frac{\log(K/\delta)}{n}\Big{)}^{\nicefrac{{1}}{{2}}}.

{\rm KL}(f^{*}||f_{\widehat{\bm{\pi}}})\leq\inf_{\bm{\pi}\in{\mathbb{B}}_{+}^{K}}{\rm KL}(f^{*}||f_{\bm{\pi}})+c_{1}\Big{(}\frac{\log(K/\delta)}{n}\Big{)}^{\nicefrac{{1}}{{2}}}.

KL (f^{*} ∣∣ f_{π})

KL (f^{*} ∣∣ f_{π})

KL (f^{*} ∣∣ f_{π})

E [KL (f^{*} ∣∣ f_{π})]

E [KL (f^{*} ∣∣ f_{π})]

E [KL (f^{*} ∣∣ f_{π})]

\mathbf{E}[{\rm KL}(f^{*}||f_{\widehat{\bm{\pi}}})]\leq\inf_{j\in[K]}\bigg{\{}{\rm KL}(f^{*}||f_{j})+\frac{c_{9}\log K}{n\bar{\kappa}_{\bm{\Sigma}}(J,1)}\bigg{\}}.

\mathbf{E}[{\rm KL}(f^{*}||f_{\widehat{\bm{\pi}}})]\leq\inf_{j\in[K]}\bigg{\{}{\rm KL}(f^{*}||f_{j})+\frac{c_{9}\log K}{n\bar{\kappa}_{\bm{\Sigma}}(J,1)}\bigg{\}}.

E [KL (f^{*} ∣∣ f_{π})] \leq π \in B_{+}^{K} in f KL (f^{*} ∣∣ f_{π}) + \frac{c _{9} K lo g K}{n κ ˉ _{Σ} ([ K ] , 1 )} .

E [KL (f^{*} ∣∣ f_{π})] \leq π \in B_{+}^{K} in f KL (f^{*} ∣∣ f_{π}) + \frac{c _{9} K lo g K}{n κ ˉ _{Σ} ([ K ] , 1 )} .

E [KL (f^{*} ∣∣ f_{π})] \leq π \in B_{+}^{K} : ∥ π ∥_{0} \leq D in f KL (f^{*} ∣∣ f_{π}) + \frac{c _{9} D lo g K}{n κ ˉ _{Σ} ( D , 1 )} .

E [KL (f^{*} ∣∣ f_{π})] \leq π \in B_{+}^{K} : ∥ π ∥_{0} \leq D in f KL (f^{*} ∣∣ f_{π}) + \frac{c _{9} D lo g K}{n κ ˉ _{Σ} ( D , 1 )} .

\mathcal{H}_{\mathcal{F}}(\gamma,D)=\Big{\{}f_{\bm{\pi}}:\bm{\pi}\in{\mathbb{B}}^{K}_{+}\text{ such that }\exists\,J\subset[K]\text{ with }\|\bm{\pi}_{J^{c}}\|_{1}\leq\gamma\text{ and }|J|\leq D\Big{\}}.

\mathcal{H}_{\mathcal{F}}(\gamma,D)=\Big{\{}f_{\bm{\pi}}:\bm{\pi}\in{\mathbb{B}}^{K}_{+}\text{ such that }\exists\,J\subset[K]\text{ with }\|\bm{\pi}_{J^{c}}\|_{1}\leq\gamma\text{ and }|J|\leq D\Big{\}}.

\mathcal{R}\big{(}\mathcal{H}_{\mathcal{F}}(\gamma,D)\big{)}=\inf_{\widehat{f}}\sup_{f^{*}}\Big{\{}\mathbf{E}[{\rm KL}(f^{*}||\,\widehat{f}\,)]-\inf_{f_{\bm{\pi}}\in\mathcal{H}_{\mathcal{F}}(\gamma,D)}{\rm KL}(f^{*}||f_{\bm{\pi}})\Big{\}},

\mathcal{R}\big{(}\mathcal{H}_{\mathcal{F}}(\gamma,D)\big{)}=\inf_{\widehat{f}}\sup_{f^{*}}\Big{\{}\mathbf{E}[{\rm KL}(f^{*}||\,\widehat{f}\,)]-\inf_{f_{\bm{\pi}}\in\mathcal{H}_{\mathcal{F}}(\gamma,D)}{\rm KL}(f^{*}||f_{\bm{\pi}})\Big{\}},

\mathcal{R}\big{(}\mathcal{H}_{\mathcal{F}}(\gamma,D)\big{)}\leq C\Big{\{}\Big{(}\frac{\gamma^{2}\log K}{n}\Big{)}^{\nicefrac{{1}}{{2}}}+\frac{D\log K}{n}\Big{\}}\bigwedge\Big{(}\frac{\log K}{n}\Big{)}^{\nicefrac{{1}}{{2}}},

\mathcal{R}\big{(}\mathcal{H}_{\mathcal{F}}(\gamma,D)\big{)}\leq C\Big{\{}\Big{(}\frac{\gamma^{2}\log K}{n}\Big{)}^{\nicefrac{{1}}{{2}}}+\frac{D\log K}{n}\Big{\}}\bigwedge\Big{(}\frac{\log K}{n}\Big{)}^{\nicefrac{{1}}{{2}}},

\mathcal{R}(\mathcal{H}_{\mathcal{F}}(\gamma,D))\geq A\bigg{\{}\bigg{[}\frac{\gamma^{2}}{n}\log\bigg{(}1+\frac{K}{\gamma\sqrt{n}}\bigg{)}\bigg{]}^{\nicefrac{{1}}{{2}}}+\frac{D\log(1+K/D)}{n}\bigg{\}}\bigwedge\bigg{[}\frac{1}{n}\log\bigg{(}1+\frac{K}{\sqrt{n}}\bigg{)}\bigg{]}^{\nicefrac{{1}}{{2}}}.

\mathcal{R}(\mathcal{H}_{\mathcal{F}}(\gamma,D))\geq A\bigg{\{}\bigg{[}\frac{\gamma^{2}}{n}\log\bigg{(}1+\frac{K}{\gamma\sqrt{n}}\bigg{)}\bigg{]}^{\nicefrac{{1}}{{2}}}+\frac{D\log(1+K/D)}{n}\bigg{\}}\bigwedge\bigg{[}\frac{1}{n}\log\bigg{(}1+\frac{K}{\sqrt{n}}\bigg{)}\bigg{]}^{\nicefrac{{1}}{{2}}}.

\kappa^{\rm RE}_{\bm{\Sigma}}(s,c)=\inf\big{\{}\|\bm{\Sigma}^{\nicefrac{{1}}{{2}}}\bm{v}\|_{2}^{2}:\,\exists\,J\subset[K]\text{ s.t. }|J|\leq s,\ \|\bm{v}_{J^{c}}\|_{1}\leq c\|\bm{v}_{J}\|_{1}\ \text{and}\ \|\bm{v}_{J}\|_{2}=1\big{\}}.

\kappa^{\rm RE}_{\bm{\Sigma}}(s,c)=\inf\big{\{}\|\bm{\Sigma}^{\nicefrac{{1}}{{2}}}\bm{v}\|_{2}^{2}:\,\exists\,J\subset[K]\text{ s.t. }|J|\leq s,\ \|\bm{v}_{J^{c}}\|_{1}\leq c\|\bm{v}_{J}\|_{1}\ \text{and}\ \|\bm{v}_{J}\|_{2}=1\big{\}}.

∥ π - π^{*} ∥_{1}

∥ π - π^{*} ∥_{1}

∥ π - π^{*} ∥_{2}

∥ π - π^{*} ∥_{2}^{2}

KL (f^{*} ∣∣ f_{π})

KL (f^{*} ∣∣ f_{π})

∥ f^{*} - f_{π} ∥_{L^{2} (P^{*})}^{2}

∥ f^{*} - f_{π} ∥_{L^{2} (P^{*})}^{2}

\frac{1}{n} i = 1 \sum n ℓ (f_{π} (X_{i})) \leq \frac{1}{n} i = 1 \sum n ℓ (f_{π} (X_{i})) - \frac{1}{2 M ^{2} n} ∥ \overset{ˉ}{Z} (π - π) ∥_{2}^{2} .

\frac{1}{n} i = 1 \sum n ℓ (f_{π} (X_{i})) \leq \frac{1}{n} i = 1 \sum n ℓ (f_{π} (X_{i})) - \frac{1}{2 M ^{2} n} ∥ \overset{ˉ}{Z} (π - π) ∥_{2}^{2} .

ℓ (f_{π} (X_{i})) = KL (f^{*} ∣∣ f_{π}) - \int_{X} f^{*} lo g f^{*} d ν + φ (π, X_{i}) .

ℓ (f_{π} (X_{i})) = KL (f^{*} ∣∣ f_{π}) - \int_{X} f^{*} lo g f^{*} d ν + φ (π, X_{i}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Optimal Kullback-Leibler Aggregation in Mixture Density Estimation by Maximum Likelihood

Arnak S. Dalalyan and Mehdi Sebbar

Abstract

We study the maximum likelihood estimator of density of $n$ independent observations, under the assumption that it is well approximated by a mixture with a large number of components. The main focus is on statistical properties with respect to the Kullback-Leibler loss. We establish risk bounds taking the form of sharp oracle inequalities both in deviation and in expectation. A simple consequence of these bounds is that the maximum likelihood estimator attains the optimal rate $((\log K)/n)^{\nicefrac{{1}}{{2}}}$ , up to a possible logarithmic correction, in the problem of convex aggregation when the number $K$ of components is larger than $n^{\nicefrac{{1}}{{2}}}$ . More importantly, under the additional assumption that the Gram matrix of the components satisfies the compatibility condition, the obtained oracle inequalities yield the optimal rate in the sparsity scenario. That is, if the weight vector is (nearly) $D$ -sparse, we get the rate $(D\log K)/n$ . As a natural complement to our oracle inequalities, we introduce the notion of nearly- $D$ -sparse aggregation and establish matching lower bounds for this type of aggregation.

1 Introduction

Assume that we observe $n$ independent random vectors $\bm{X}_{1},\ldots,\bm{X}_{n}\in\mathcal{X}$ drawn from a probability distribution $P^{*}$ that admits a density function $f^{*}$ with respect to some reference measure $\nu$ . The goal is to estimate the unknown density by a mixture density. More precisely, we assume that for a given family of mixture components $f_{1},\ldots,f_{K}$ , the unknown density of the observations $f^{*}$ is well approximated by a convex combination $f_{\bm{\pi}}$ of these components, where

[TABLE]

The assumption that the component densities $\mathcal{F}=\{f_{j}:j\in[K]\}$ are known essentially means that they are chosen from a dictionary obtained on the basis of previous experiments or expert knowledge.

We focus on the problem of estimation of the density function $f_{\bm{\pi}}$ and the weight vector $\bm{\pi}$ from the simplex ${\mathbb{B}}_{+}^{K}$ under the sparsity scenario: the ambient dimension $K$ can be large, possibly larger than the sample size $n$ , but most entries of $\bm{\pi}$ are either equal to zero or very small.

Our goal is to investigate the statistical properties of the Maximum Likelihood Estimator (MLE), defined by

[TABLE]

where the minimum is computed over a suitably chosen subset $\Pi$ of ${\mathbb{B}}_{+}^{K}$ . In the present work, we will consider sets $\Pi=\Pi_{n}(\mu)$ , depending on a parameter $\mu>0$ and the sample $\{\bm{X}_{1},\ldots,\bm{X}_{n}\}$ , defined by

[TABLE]

Note that the objective function in (3) is convex and the same is true for set (4). Therefore, the MLE $\widehat{\bm{\pi}}$ can be efficiently computed even for large $K$ by solving a problem of convex programming. To ease notation, very often, we will omit the dependence of $\Pi_{n}(\mu)$ on $\mu$ and write $\Pi_{n}$ instead of $\Pi_{n}(\mu)$ .

The quality of an estimator $\widehat{\bm{\pi}}$ can be measured in various ways. For instance, one can consider the Kullback-Leibler divergence

[TABLE]

which has the advantage of bypassing identifiability issues. One can also consider the (well-specified) setting where $f^{*}=f_{\bm{\beta}^{*}}$ for some $\bm{\beta}^{*}\in{\mathbb{B}}^{K}_{+}$ and measure the quality of estimation through a distance between the vectors $\widehat{\bm{\pi}}$ and $\bm{\pi}^{*}$ (such as the $\ell_{1}$ -norm $\|\widehat{\bm{\pi}}-\bm{\pi}^{*}\|_{1}$ or the Euclidean norm $\|\widehat{\bm{\pi}}-\bm{\pi}^{*}\|_{2}$ ).

The main contributions of the present work are the following:

(a)

We demonstrate that in the mixture model there is no need to introduce sparsity favoring penalty in order to get optimal rates of estimation under the Kullback-Leibler loss in the sparsity scenario. In fact, the constraint that the weight vector belongs to the simplex acts as a sparsity inducing penalty. As a consequence, there is no need to tune a parameter accounting for the magnitude of the penalty. 2. (b)

We show that the maximum likelihood estimator of the mixture density simultaneously attains the optimal rate of aggregation for the Kullback-Leibler loss for at least three types of aggregation: model-selection, convex and $D$ -sparse aggregation. 3. (c)

We introduce a new type of aggregation, termed nearly $D$ -sparse aggregation that extends and unifies the notions of convex and $D$ -sparse aggregation. We establish strong lower bounds for the nearly $D$ -sparse aggregation and demonstrate that the maximum likelihood estimator attains this lower bound up to logarithmic factors.

1.1 Related work

The results developed in the present work aim to gain a better understanding (a) of the statistical properties of the maximum likelihood estimator over a high-dimensional simplex and (b) of the problem of aggregation of density estimators under the Kullback-Leibler loss. Various procedures of aggregation111We refer the interested reader to (Tsybakov, 2014) for an up to date introduction into aggregation of statistical procedures. for density estimation have been studied in the literature with respect to different loss functions. (Catoni, 1997; Yang, 2000; Juditsky et al., 2008) investigated different variants of the progressive mixture rules, also known as mirror averaging (Yuditskiĭ et al., 2005; Dalalyan and Tsybakov, 2012), with respect to the Kullback-Leibler loss and established model selection type oracle inequalities222This means that they prove that the expected loss of the aggregate is almost as small as the loss of the best element of the dictionary $\{f_{1},\ldots,f_{K}\}$ . in expectation. Same type of guarantees, but holding with high probability, were recently obtained in (Bellec, 2014; Butucea et al., 2016) for the procedure termed $Q$ -aggregation, introduced in other contexts by (Dai et al., 2012; Rigollet, 2012).

Aggregation of estimators of a probability density function under the $L_{2}$ -loss was considered in (Rigollet and Tsybakov, 2007), where it was shown that a suitably chosen unbiased risk estimate minimizer is optimal both for convex and linear aggregation. The goal in the present work is to go beyond the settings of the aforementioned papers in that we want simultaneously to do as well as the best element of the dictionary, the best convex combination of the dictionary elements but also the best sparse convex combination. Note that the latter task was coined $D$ -aggregation in (Lounici, 2007) (see also (Bunea et al., 2007)). In the present work, we rename it in $D$ -sparse aggregation, in order to make explicit its relation to sparsity.

Key differences between the latter work and ours are that we do not assume the sparsity index to be known and we are analyzing an aggregation strategy that is computationally tractable even for large $K$ . This is also the case of (Bunea et al., 2010; Bertin et al., 2011), which are perhaps the most relevant references to the present work. These papers deal with the $L_{2}$ -loss and investigate the lasso and the Dantzig estimators, respectively, suitably adapted to the problem of density estimation. Their methods handle dictionary elements $\{f_{j}\}$ which are not necessarily probability density functions, but has the drawback of requiring the choice of a tuning parameter. This choice is a nontrivial problem in practice. Instead, we show here that the optimal rates of sparse aggregation with respect to the Kullback-Leibler loss can be attained by procedure which is tuning parameter free.

Risk bounds for the maximum likelihood and other related estimators in the mixture model have a long history (Li and Barron, 1999; Li, 1999; Rakhlin et al., 2005). For the sake of comparison we recall here two elegant results providing non-asymptotic guarantees for the Kullback-Leibler loss.

Theorem 1.1 (Theorem 5.1 in (Li, 1999)).

Let $\mathcal{F}$ be a finite dictionary of cardinality $K$ of density functions such that $\max_{f\in\mathcal{F}}\|f^{*}/f\|_{\infty}\leq V$ . Then, the maximum likelihood estimator over $\mathcal{F}$ , $\widehat{f}^{\rm ML}_{\mathcal{F}}\in{\rm arg}\max_{f\in\mathcal{F}}\sum_{i=1}^{n}\log f(\bm{X}_{i})$ , satisfies the inequality

[TABLE]

Inequality (6) is an inexact oracle inequality in expectation that quantifies the ability of $\widehat{f}^{\rm ML}_{\mathcal{F}}$ to solve the problem of model-selection aggregation. The adjective inexact refers to the fact that the “bias term” $\min_{f\in\mathcal{F}}{\rm KL}(f^{*}||f)$ is multiplied by factor strictly larger than one. It is noteworthy that the remainder term $\frac{2\log K}{n}$ corresponds to the optimal rate of model-selection aggregation (Juditsky and Nemirovski, 2000; Tsybakov, 2003). In relation with Theorem 1.1, it is worth mentioning a result of (Yang, 2000) and (Catoni, 1997), see also Theorem 5 in (Lecué, 2006) and Corollary 5.4 in (Juditsky et al., 2008), establishing a risk bound similar to (6) without the extra factor $2+\log V$ for the so called mirror averaging aggregate.

Theorem 1.2 (page 226 in (Rakhlin et al., 2005)).

Let $\mathcal{F}$ be a finite dictionary of cardinality $K$ of density functions and let $\mathcal{C}_{k}=\big{\{}f_{\bm{\pi}}:\|\bm{\pi}\|_{0}\leq k\big{\}}$ be the set of all the mixtures of at most $k$ elements of $\mathcal{F}$ ( $k\in[K]$ ). Assume that $f^{*}$ and the densities $f_{k}$ from $\mathcal{F}$ are bounded from below and above by some positive constants $m$ and $M$ , respectively. Then, there is a constant $C$ depending only on $m$ and $M$ such that, for any tolerance level $\delta\in(0,1)$ , the maximum likelihood estimator over $\mathcal{C}_{k}$ , $\widehat{f}^{\rm ML}_{\mathcal{C}_{k}}\in{\rm arg}\max_{f\in\mathcal{C}_{k}}\sum_{i=1}^{n}\log f(\bm{X}_{i})$ , satisfies the inequality

[TABLE]

with probability at least $1-\delta$ .

This result is remarkably elegant and can be seen as an exact oracle inequality in deviation for $D$ -sparse aggregation (for $D=k$ ). Furthermore, if we choose $k=K$ in Theorem 1.2, then we get an exact oracle inequality for convex aggregation with a rate-optimal remainder term (Tsybakov, 2003). However, it fails to provide the optimal rate for $D$ -sparse aggregation.

Closing this section, we would like to mention the recent work (Xia and Koltchinskii, 2016), where oracle inequalities for estimators of low rank density matrices are obtained. They share a common feature with those obtained in this work: the adaptation to the unknown sparsity or rank is achieved without any additional penalty term. The constraint that the unknown parameter belongs to the simplex acts as a sparsity inducing penalty.

1.2 Additional notation

In what follows, for any $i\in[n]$ , we denote by $\bm{Z}_{i}$ the vector $[f_{1}(\bm{X}_{i}),\ldots,f_{K}(\bm{X}_{i})]^{\top}$ and by $\mathbf{Z}$ the $n\times K$ matrix $[\bm{Z}^{\top}_{1},\ldots,\bm{Z}^{\top}_{n}]^{\top}$ . We also define $\ell(u)=-\log u$ , $u\in(0,+\infty)$ , so that the MLE $\widehat{\bm{\pi}}$ is the minimizer of the function

[TABLE]

For any set of indices $J\subseteq[K]$ and any $\bm{\pi}=(\pi_{1},\dots,\pi_{K})^{\top}\in{\mathbb{R}}^{K}$ , we define $\bm{\pi}_{J}$ as the $K$ -dimensional vector whose $j$ -th coordinate equals $\pi_{j}$ if $j\in J$ and [math] otherwise. We denote the cardinality of any $J\subseteq[K]$ by $|J|$ . For any set $J\subset\{1,\dots,K\}$ and any constant $c\geq 0$ , we introduce the compatibility constants (van de Geer and Bühlmann, 2009) of a $K\times K$ positive semidefinite matrix $\mathbf{A}$ ,

[TABLE]

The risk bounds established in the present work involve the factors $\kappa_{\mathbf{A}}(J,3)$ and $\bar{\kappa}_{\mathbf{A}}(J,1)$ . One can easily check that $\bar{\kappa}_{\mathbf{A}}(J,3)\leq\kappa_{\mathbf{A}}(J,3)\leq\frac{9}{4}\bar{\kappa}_{\mathbf{A}}(J,1)$ . We also recall that the compatibility constants of a matrix $\mathbf{A}$ are bounded from below by the smallest eigenvalue of $\mathbf{A}$ .

Let us fix a function $f_{0}:\mathcal{X}\to{\mathbb{R}}$ and denote $\bar{f}_{k}=f_{k}-f_{0}$ and $\bar{\bm{Z}}_{i}=[\bar{f}_{1}(\bm{X}_{i}),\ldots,\bar{f}_{K}(\bm{X}_{i})]^{\top}$ for $i\in[n]$ . In the results of this work, the compatibility factors are used for the empirical and population Gram matrices of vectors $\bar{\bm{Z}}_{k}$ , that is when $\mathbf{A}=\widehat{\bm{\Sigma}}_{n}$ and $\mathbf{A}=\bm{\Sigma}$ with

[TABLE]

The general entries of these matrices are respectively $(\widehat{\bm{\Sigma}}_{n})_{k,l}=\nicefrac{{1}}{{n}}\sum_{i=1}^{n}\bar{f}_{k}(\bm{X}_{i})\bar{f}_{l}(\bm{X}_{i})$ and $(\bm{\Sigma})_{k,l}=\mathbf{E}[\bar{f}_{k}(\bm{X}_{1})\bar{f}_{l}(\bm{X}_{1})]$ .

We assume that there exist positive constants $m$ and $M$ such that for all densities $f_{k}$ with $k\in[K]$ , we have

[TABLE]

We use the notation $V=M/m$ . It is worth mentioning that the set of dictionaries satisfying simultaneously this boundedness assumption and the aforementioned compatibility condition is not empty. For instance, one can consider the functions $f_{k}(x)=1+\nicefrac{{1}}{{2}}\sin(2\pi kx)$ for $k\in[K]$ . These functions are probability densities w.r.t. the Lebesgue measure on $\mathcal{X}=[0,1]$ . They are bounded from below and from above by $\nicefrac{{1}}{{2}}$ and $\nicefrac{{3}}{{2}}$ , respectively. Taking $f_{0}(x)=1$ , the corresponding Gram matrix is $\bm{\Sigma}=\nicefrac{{1}}{{8}}\,\mathbf{I}_{K}$ , which has all eigenvalues equal to $\nicefrac{{1}}{{8}}$ .

1.3 Agenda

The rest of the paper is organized as follows. In Section 2, we state our main theoretical contributions and discuss their consequences. Possible relaxations of the conditions, as well as lower bounds showing the tightness of the established risk bounds, are considered in Section 3. A brief summary of the paper and some future directions of research are presented in Section 4. The proofs of all theoretical results are postponed to Section 5 and Section 6.

2 Oracle inequalities in deviation and in expectation

In this work, we prove several non-asymptotic risk bounds that imply, in particular, that the maximum likelihood estimator is optimal in model-selection aggregation, convex aggregation and $D$ -sparse aggregation (up to $\log$ -factors). In all the results of this section we assume the parameter $\mu$ in (4) to be equal to [math].

Theorem 2.1.

Let $\cal F$ be a set of $K\geq 4$ densities satisfying the boundedness condition (12). Denote by $f_{\widehat{\bm{\pi}}}$ the mixture density corresponding to the maximum likelihood estimator $\widehat{\bm{\pi}}$ over $\Pi_{n}$ defined in (8). There are constants $c_{1}\leq 32V^{3}$ , $c_{2}\leq 288M^{2}V^{6}$ and $c_{3}\leq 128M^{2}V^{6}$ such that, for any $\delta\in(0,\nicefrac{{1}}{{2}})$ , the following inequalities hold

[TABLE]

with probability at least $1-\delta$ .

The proof of this and the subsequent results stated in this section are postponed to Section 5. Comparing the two inequalities of the above theorem, one can notice two differences. First, the term proportional to $\|\bm{\pi}_{J^{c}}\|_{1}$ is absent in the second risk bound, which means that the risk of the MLE is compared to that of the best mixture with a weight sequences supported by $J$ . Hence, this risk bound is weaker than the first one provided by (13). Second, the compatibility factor $\bar{\kappa}_{\widehat{\bm{\Sigma}}_{n}}(J,1)$ in (14) is larger that its counterpart $\kappa_{\widehat{\bm{\Sigma}}_{n}}(J,3)$ in (13). This entails that in the cases where the oracle is expected to be sparse, the remainder term of the bound in (13) is slightly looser than that of (14).

A first and simple consequence of Theorem 1.1 is obtained by taking $J=\varnothing$ in the right hand side of the first inequality. Then, $\|\bm{\pi}_{J^{c}}\|_{1}=\|\bm{\pi}\|_{1}=1$ and we get

[TABLE]

This implies that for every dictionary $\cal F$ , without any assumption on the smallness of the coherence between its elements, the maximum likelihood estimator achieves the optimal rate of convex aggregation, up to a possible333In fact, the optimal rate of convex aggregation when $K\geq n^{\nicefrac{{1}}{{2}}}$ is of order $\normalsize\big{(}\nicefrac{{\log(K/n^{\nicefrac{{1}}{{2}}})}}{{\displaystyle n}}\big{)}^{\nicefrac{{1}}{{2}}}$ . Therefore, even the $\log K$ term is optimal whenever $K\geq Cn^{\nicefrac{{1}}{{2}}+\alpha}$ for some $\alpha>0$ . logarithmic correction, in the high-dimensional regime $K\geq n^{\nicefrac{{1}}{{2}}}$ . In the case of regression with random design, an analogous result has been proved by Lecué and Mendelson (2013) and Lecué (2013). One can also remark that the upper bound in (15) is of the same form as the one of Theorem 1.2 stated in section 1.1 above.

The main compelling feature of our results is that they show that the MLE adaptively achieves the optimal rate of aggregation not only in the case of convex aggregation, but also for the model-selection aggregation and $D$ -(convex) aggregation. For handling these two cases, it is more convenient to get rid of the presence of the compatibility factor of the empirical Gram matrix $\widehat{\bm{\Sigma}}_{n}$ . The latter can be replaced by the compatibility factor of the population Gram matrix, as stated in the next result.

Theorem 2.2.

Let $\cal F$ be a set of $K$ densities satisfying the boundedness condition (12). Denote by $f_{\widehat{\bm{\pi}}}$ the mixture density corresponding to the maximum likelihood estimator $\widehat{\bm{\pi}}$ over $\Pi_{n}$ defined in (8). There are constants $c_{4}\leq 32V^{3}+4$ , $c_{5}\leq 4.5M^{2}(8\,V^{3}+1)^{2}$ and $c_{6}\leq 2M^{2}(8\,V^{3}+1)^{2}$ such that, for any $\delta\in(0,\nicefrac{{1}}{{2}})$ , the following inequalities hold

[TABLE]

with probability at least $1-2\delta$ .

The main advantage of the upper bounds provided by Theorem 2.2 as compared with those of Theorem 2.1 is that the former is deterministic, whereas the latter involves the compatibility factor of the empirical Gram matrix which is random. The price to pay for getting rid of randomness in the risk bound is the increased values of the constants $c_{4}$ , $c_{5}$ and $c_{6}$ . Note, however, that this price is not too high, since obviously $1\leq M\leq L$ and, therefore, $c_{4}\leq 1.25c_{1}$ , $c_{5}\leq 1.56c_{2}$ and $c_{6}\leq 1.56c_{3}$ . In addition, the absence of randomness in the risk bound allows us to integrate it and to convert the bound in deviation into a bound in expectation.

Theorem 2.3 (Bound in Expectation).

Let $\cal F$ be a set of $K$ densities satisfying the boundedness condition (12). Denote by $f_{\widehat{\bm{\pi}}}$ the mixture density corresponding to the maximum likelihood estimator $\widehat{\bm{\pi}}$ over $\Pi_{n}$ defined in (8). There are constants $c_{7}\leq 20V^{3}+8$ , $c_{8}\leq M^{2}(22V^{3}+3)^{2}$ and $c_{9}\leq M^{2}(15V^{3}+2)^{2}$ such that

[TABLE]

In inequality (19), upper bounding the infimum over all sets $J$ by the infimum over the singletons, we get

[TABLE]

This implies that the maximum likelihood estimator $f_{\widehat{\pi}}$ achieves the rate $\frac{\log K}{n}$ in model-selection type aggregation. This rate is known to be optimal in the model of regression (Rigollet, 2012). If we compare this result with Theorem 1.1 stated in Section 1.1, we see that the remainder terms of these two oracle inequalities are of the same order (provided that the compatibility factor is bounded away from zero), but inequality (20) has the advantage of being exact.

We can also apply (19) to the problem of convex aggregation with small dictionary, that is for $K$ smaller than $n^{\nicefrac{{1}}{{2}}}$ . Upper bounding $|J|$ by $|K|$ , we get

[TABLE]

Assuming, for instance, the smallest eigenvalue of $\bm{\Sigma}$ bounded away from zero (which is a quite reasonable assumption in the context of low dimensionality), the above upper bound provides a rate of convex aggregation of the order of $\frac{K\log K}{n}$ . Up to a logarithmic term, this rate is known to be optimal for convex aggregation in the model of regression.

Finally, considering all the sets $J$ of cardinal smaller than $D$ (with $D\leq K$ ) and setting $\bar{\kappa}_{\bm{\Sigma}}(D,1)=\inf_{J:|J|\leq D}\bar{\kappa}_{\bm{\Sigma}}(J,1)$ , we deduce from (19) that

[TABLE]

According to (Rigollet and Tsybakov, 2011, Theorem 5.3), in the regression model, the optimal rate of $D$ -sparse aggregation is of order $(D/n)\log(K/D)$ , whenever $D=o(n^{\nicefrac{{1}}{{2}}})$ . Inequality (22) shows that the maximum likelihood estimator over the simplex achieves this rate up to a logarithmic factor. Furthermore, this logarithmic inflation disappears when the sparsity $D$ is such that, asymptotically, the ratio $\frac{\log D}{\log K}$ is bounded from above by a constant $\alpha<1$ . Indeed, in such a situation the optimal rate $\frac{D\log(K/D)}{n}=\frac{D\log K}{n}(1-\frac{\log D}{\log K})$ is of the same order as the remainder term in (22), that is $\frac{D\log K}{n}$ .

3 Discussion of the conditions and possible extensions

In this section, we start by announcing lower bounds for the Kullback-Leibler aggregation in the problem of density estimation. Then we discuss the implication of the risk bounds of the previous section to the case where the target is the weight vector $\bm{\pi}$ rather than the mixture density $f_{\bm{\pi}}$ . Finally, we present some extensions to the case where the boundedness assumption is violated.

3.1 Lower bounds for nearly- $D$ -sparse aggregation

As mentioned in previous section, the literature is replete with lower bounds on the minimax risk for various types of aggregation. However most of them concern the regression setting either with random or with deterministic design. Lower bounds of aggregation for density estimation were first established by Rigollet (2006) for the $L_{2}$ -loss. In the case of Kullback-Leibler aggregation in density estimation, the only lower bounds we are aware are those established by Lecué (2006) for model-selection type aggregation. It is worth emphasizing here that the results of the aforementioned two papers provide weak lower bounds. Indeed, they establish the existence of a dictionary for which the minimax excess risk is lower bounded by the suitable quantity. In contrast with this, we establish here strong lower bounds that hold for every dictionary satisfying the boundedness and the compatibility conditions.

Let $\mathcal{F}=\{f_{1},\ldots,f_{K}\}$ be a dictionary of density functions on $\mathcal{X}=[0,1]$ . We say that the dictionary $\mathcal{F}$ satisfies the boundedness and the compatibility assumptions if for some positive constants $m,M$ and $\kappa$ , we have $m\leq f_{j}(x)\leq M$ for all $j\in[K]$ , $x\in\mathcal{X}$ . In addition, we assume in this subsection that all the eigenvalues of the Gram matrix $\bm{\Sigma}$ belong to the interval $[\varkappa_{*},\varkappa^{*}]$ , with $\varkappa_{*}>0$ and $\varkappa^{*}<\infty$ .

For every $\gamma\in(0,1)$ and any $D\in[K]$ , we define the set of nearly- $D$ -sparse convex combinations of the dictionary elements $f_{j}\in\mathcal{F}$ by

[TABLE]

In simple words, $f_{\bm{\pi}}$ belongs to $\mathcal{H}_{\mathcal{F}}(\gamma,D)$ if it admits a $\gamma$ -approximately $D$ -sparse representation in the dictionary $\mathcal{F}$ . We are interested in bounding from below the minimax excess risk

[TABLE]

where the $\inf$ is over all possible estimators of $f^{*}$ and the $\sup$ is over all density functions over $[0,1]$ . Note that the estimator $\widehat{f}$ is not necessarily a convex combination of the dictionary elements. Furthermore, it is allowed to depend on the parameters $\gamma$ and $D$ characterizing the class $\mathcal{H}_{\mathcal{F}}(\gamma,D)$ . It follows from (18), that if the dictionary satisfies the boundedness and the compatibility condition, then

[TABLE]

for some constant $C$ depending only on $m,M$ and $\varkappa_{*}$ . Note that the last term accounts for the following phenomenon: If the sparsity index $D$ is larger than a multiple of $\sqrt{n}$ , then the sparsity bears no advantage as compared to the $\ell_{1}$ constraint. The next result implies that this upper bound is optimal, at least up to logarithmic factors.

Theorem 3.1.

Assume that $\log(1+eK)\leq n$ . Let $\gamma\in(0,1)$ and $D\in[K]$ be fixed. There exists a constant $A$ depending only on $m$ , $M$ , $\varkappa_{*}$ and $\varkappa^{*}$ such that

[TABLE]

This is the first result providing lower bounds on the minimax risk of aggregation over nearly- $D$ -sparse aggregates. To the best of our knowledge, even in the Gaussian sequence model, such a result has not been established to date. It has the advantage of unifying the results on convex and $D$ -sparse aggregation, as well as extending them to a more general class. Let us also stress that the condition $\log(1+eK)\leq n$ is natural and unavoidable, since it ensures that the right hand side of (25) is smaller than the trivial bound $\log V$ .

3.2 Weight vector estimation

The risk bounds carried out in the previous section for the problem of density estimation in the Kullback-Leibler loss imply risk bounds for the problem of weight vector estimation. Indeed, under the boundedness assumption (12), the Kullback-Leibler divergence between two mixture densities can be shown to be equivalent to the squared Mahalanobis distance between the weight vectors of these mixtures with respect to the Gram matrix. In order to go from the Mahalanobis distance to the Euclidean one, we make use of the restricted eigenvalue

[TABLE]

This strategy leads to the next result.

Proposition 1.

Let $\cal F$ be a set of $K\geq 4$ densities satisfying condition (12). Denote by $f_{\widehat{\bm{\pi}}}$ the mixture density corresponding to the maximum likelihood estimator $\widehat{\bm{\pi}}$ over $\Pi_{n}$ defined in (8). Let $\bm{\pi}^{*}$ the weight-vector of the best mixture density: $\bm{\pi}^{*}\in\text{\rm arg}\min_{\bm{\pi}}{\rm KL}(f^{*}||f_{\bm{\pi}})$ , and let $J^{*}$ be the support of $\bm{\pi}^{*}$ . There are constants $c_{10}\leq M^{2}(64V^{3}+8)$ and $c_{11}\leq 4M^{2}(8V^{3}+1)$ such that, for any $\delta\in(0,\nicefrac{{1}}{{2}})$ , the following inequalities hold

[TABLE]

with probability at least $1-2\delta$ .

In simple words, this result tells us that the wight estimator $\widehat{\bm{\pi}}$ attains the minimax rate of estimation $|J^{*}|(\frac{\log(K)}{n})^{\nicefrac{{1}}{{2}}}$ over the intersection of the $\ell_{1}$ and $\ell_{0}$ balls, when the error is measured by the $\ell_{1}$ -norm, provided that the compatibility factor of the dictionary $\mathcal{F}$ is bounded away from zero. The optimality of this rate—up to logarithmic factors—follows from the fact that the error of estimation of each nonzero coefficients of $\bm{\pi}^{*}$ is at least $cn^{-\nicefrac{{1}}{{2}}}$ (for some $c>0$ ), leading to a sum of the absolute values of the errors at least of the order $|J^{*}|n^{-\nicefrac{{1}}{{2}}}$ . The logarithmic inflation of the rate is the price to pay for not knowing the support $J^{*}$ . It is clear that this reasoning is valid only when the sparsity $|J^{*}|$ is of smaller order than $n^{\nicefrac{{1}}{{2}}}$ . Indeed, in the case $|J^{*}|\geq cn^{\nicefrac{{1}}{{2}}}$ , the trivial bound $\|\widehat{\bm{\pi}}-\bm{\pi}^{*}\|_{1}\leq 2$ is tighter than the one in (28).

Concerning the risk measured by the Euclidean norm, we underline that there are two regimes characterized by the order between upper bounds in (29) and (30). Roughly speaking, when the signal is highly sparse in the sense that $|J^{*}|$ is smaller than $(n/\log K)^{\nicefrac{{1}}{{2}}}$ , then the smallest bound is given by (29) and is of the order $\frac{|J^{*}|\log(K)}{n}$ . This rate is can be compared to the rate $\frac{|J^{*}|\log(K/|J^{*}|)}{n}$ , known to be optimal in the Gaussian sequence model. In the second regime corresponding to mild sparsity, $|J^{*}|>(n/\log K)^{\nicefrac{{1}}{{2}}}$ , the smallest bound is the one in (30). The latter is of order $(\frac{\log(K)}{n})^{\nicefrac{{1}}{{2}}}$ , which is known to be optimal in the Gaussian sequence model. For various results providing lower bounds in regression framework we refer the interested reader to (Raskutti et al., 2011; Rigollet and Tsybakov, 2011; Wang et al., 2014).

3.3 Extensions to the case of vanishing components

In the previous sections we have deliberately avoided any discussion of the role of the parameter $\mu$ , present in the search space $\Pi_{n}(\mu)$ of the problem (3)-(4). In fact, when all the dictionary elements are separated from zero by a constant $m$ , a condition assumed throughout previous sections, choosing any value of $\mu\leq m$ is equivalent to choosing $\mu=0$ . Therefore, the choice of this parameter does not impact the quality of estimation. However, this parameter might have strong influence in practice both on statistical and computational complexity of the maximum likelihood estimator. A first step in understanding the influence of $\mu$ on the statistical complexity is made in the next paragraphs.

Let us consider the case where the condition $\min_{x}\min_{j}f_{j}(x)\geq m>0$ fails, but the upper-boundedness condition $\max_{x}\max_{j}f_{j}(x)\leq M$ holds true. In such a situation, we replace the definition $V=M/m$ by $V=M/\mu$ . We also define the set $\Pi^{*}(\mu)=\big{\{}\bm{\pi}\in{\mathbb{B}}^{K}_{+}:P^{*}\big{(}f_{\bm{\pi}}(\bm{X})\geq\mu\big{)}=1\big{\}}$ . In order to keep mathematical formulae simple, we will only state the equivalent of (14) in the case of $m=0$ . All the other results of the previous section can be extended in a similar way.

Proposition 2.

Let $\cal F$ be a set of $K\geq 2$ densities satisfying the boundedness condition $\sup_{\bm{x}\in\mathcal{X}}f_{j}(\bm{x})\leq M$ . Denote by $f_{\widehat{\bm{\pi}}}$ the mixture density corresponding to the maximum likelihood estimator $\widehat{\bm{\pi}}$ over $\Pi_{n}(\mu)$ defined in (8). There is a constant $\bar{c}\leq 128M^{2}V^{4}$ such that, for any $\delta\in(0,\nicefrac{{1}}{{2}})$ ,

[TABLE]

on an event of probability at least $1-\delta$ . Furthermore, if $\inf_{\bm{x}\in\mathcal{X}}f^{*}(\bm{x})\geq\mu$ , then, on the same event, we have

[TABLE]

The last term present in the first upper bound, $\int_{\mathcal{X}}(\log\mu-\log f_{\widehat{\bm{\pi}}})_{+}f^{*}d\nu$ is the price we pay for considering densities that are not lower bounded by a given constant. A simple, non-random upper bound on this term is $\int_{\mathcal{X}}\max_{k\in[K]}(\log\mu-\log f_{k})_{+}f^{*}d\nu$ . Providing a tight upper bound on this kind or remainder terms is an important problem which lies beyond the scope of the present work.

4 Conclusion

In this paper, we have established exact oracle inequalities for the maximum likelihood estimator of a mixture density. This oracle inequality clearly highlights the interplay of three sources of error: misspecification of the model of mixture, departure from $D$ -sparsity and stochastic error of estimating $D$ nonzero coefficients. We have also proved a lower bound that show that the remainder terms of our upper bounds are optimal, up to logarithmic terms. This lower bound is valid not only for the maximum likelihood estimator, but for any estimator of the density function. As a consequence, the maximum likelihood estimator has a nearly optimal excess risk in the minimax sense.

In all the results of the present paper, we have assumed that the components of the mixture model are deterministic. From a practical point of view, it might be reasonable to choose these components in a data driven way, using, for instance, a hold-out sample. This question, as well as the problem of tuning the parameter $\mu$ , constitute interesting and challenging avenues for future research.

5 Proofs of results stated in previous sections

This section collects the proofs of the theorems and claims stated in previous sections.

5.1 Proof of Theorem 2.1

The main technical ingredients of the proof are a strong convexity argument and a control of the maximum of an empirical process. The corresponding results are stated in Lemma 5.2 and 5.1, respectively, deferred to Section 5.6. We denote by $\bar{\mathbf{Z}}$ the $n\times K$ matrix $[\bar{\bm{Z}}_{1},\ldots,\bar{\bm{Z}}_{K}]$ .

Since $\widehat{\bm{\pi}}$ is a minimizer of $L_{n}(\cdot)$ , see (3) and (8), we know that $L_{n}(\widehat{\bm{\pi}})\leq L_{n}(\bm{\pi})$ for every $\pi$ . However, this inequality can be made sharper using the (local) strong convexity of the function $\ell(u)=-\log(u)$ . Indeed, Lemma 5.2 below shows that

[TABLE]

On the other hand, if we set $\varphi(\pi,\bm{x})=\int(\log f_{\bm{\pi}})f^{*}d\nu-\log f_{\bm{\pi}}(\bm{x})$ , we have $\mathbf{E}_{f^{*}}[\varphi(\bm{\pi},\bm{X}_{i})]=0$ and

[TABLE]

Combining inequalities (33) and (34), we get

[TABLE]

The next step of the proof consists in establishing a suitable upper bound on the noise term $\Phi_{n}(\bm{\pi})-\Phi_{n}(\widehat{\bm{\pi}})$ where

[TABLE]

According to the mean value theorem, setting $\zeta_{n}:=\sup_{\bar{\bm{\pi}}\in\bm{\Pi}_{n}}\big{\|}\nabla\Phi_{n}(\bar{\bm{\pi}})\big{\|}_{\infty}$ , for every vector $\bm{\pi}\in\bm{\Pi}_{n}$ , it holds that

[TABLE]

This inequality, combined with (35), yields

[TABLE]

Using the Gram matrix $\widehat{\bm{\Sigma}}_{n}=\nicefrac{{1}}{{n}}\bar{\mathbf{Z}}^{\top}\bar{\mathbf{Z}}$ , the quantity $\|\bar{\mathbf{Z}}(\widehat{\bm{\pi}}-\bm{\pi})\|_{2}$ can be rewritten as

[TABLE]

We proceed with applying the following result (Bellec et al., 2016, Lemma 2).

Lemma 5.1 (Bellec et al. (2016), Lemma 2).

For any pair of vectors $\bm{\pi},\bm{\pi}^{\prime}\in{\mathbb{R}}^{K}$ , for any pair of scalars $\mu>0$ and $\gamma>1$ , for any $K\times K$ symmetric matrix $\mathbf{A}$ and for any set $J\subset[p]$ , the following inequality is true

[TABLE]

where $c_{\gamma}=(\gamma+1)/(\gamma-1)$ .

Choosing $\mathbf{A}=\widehat{\bm{\Sigma}}_{n}^{\nicefrac{{1}}{{2}}}/(\sqrt{2}\,M)$ , $\mu=\zeta_{n}$ and $\gamma=2$ (thus $c_{\gamma}=3$ ) we get the inequality

[TABLE]

One can check that $\kappa_{\mathbf{A}^{2}}(J,3)=\kappa_{\widehat{\bm{\Sigma}}_{n}}(J,3)/(2M^{2})$ . Combining the last inequality with (38), we arrive at

[TABLE]

Since the last inequality holds for every $\bm{\pi}$ , we can insert an $\inf_{\bm{\pi}}$ in the right hand side. Furthermore, in view of 5.1 below, with probability larger than $1-\delta$ , $\zeta_{n}$ is bounded from above by $8V^{3}(\frac{\log(K/\delta)}{n})^{\nicefrac{{1}}{{2}}}$ . This completes the proof of (13).

To prove (14), we follow the same steps as above up to inequality (38). Then, we remark that for every $\bm{\pi}$ in the simplex satisfying $\bm{\pi}_{J^{c}}=0$ , it holds

[TABLE]

Therefore, $\|\widehat{\bm{\Sigma}}_{n}^{\nicefrac{{1}}{{2}}}(\widehat{\bm{\pi}}-\bm{\pi})\|_{2}^{2}\geq$ we have with probability at least $1-\delta$

[TABLE]

Replacing the right hand term in (38) and taking the infimum, we get the claim of the corollary. Since, in view of 5.1 below, with probability larger than $1-\delta$ , $\zeta_{n}$ is bounded from above by $8V^{3}(\frac{\log(K/\delta)}{n})^{\nicefrac{{1}}{{2}}}$ , we get the claim of (14).

5.2 Proof of Theorem 2.2

Let us denote $\bm{v}=\widehat{\bm{\pi}}-\bm{\pi}$ . According to (38) and (39), we have

[TABLE]

As $\bm{v}$ is the difference of two vectors lying on the simplex, we have $\|\bm{v}\|_{1}\leq 2$ . Let $\|\bm{\Sigma}-\widehat{\bm{\Sigma}}_{n}\|_{\infty}=\max_{j,j^{\prime}}|(\bm{\Sigma}-\widehat{\bm{\Sigma}}_{n})_{j,j^{\prime}}|$ stand for the largest (in absolute values) element of the matrix $\bm{\Sigma}-\widehat{\bm{\Sigma}}_{n}$ . We have

[TABLE]

Setting $\bar{\zeta}_{n}=\zeta_{n}+M^{-2}\|\bm{\Sigma}-\widehat{\bm{\Sigma}}_{n}\|_{\infty}$ , we get

[TABLE]

Following the same steps as those used for obtaining (42), we arrive at

[TABLE]

The last step consists in evaluating the quantiles of the random variable $\bar{\zeta}_{n}$ . To this end, one checks that the Hoeffding inequality combined with the union bound yields

[TABLE]

In other terms, for every $\delta\in(0,1)$ , we have

[TABLE]

Note that for $\delta\leq 1$ , we have $\log(K^{2}/\delta)\leq 2\log(K/\delta)$ . Combining with 5.1, this implies that $\bar{\zeta}_{n}\leq(8V^{3}+1)\big{(}\frac{\log(K/\delta)}{n}\big{)}^{\nicefrac{{1}}{{2}}}$ with probability larger than $1-2\delta$ . This completes the proof of (16). The proof of (17) is omitted since it repeats the same arguments as those used for proving (14).

5.3 Proof of Theorem 2.3

According to (51), for any $\bm{\pi}\in\Pi$ and any $J\subset\{1,\dots,K\}$ , we have

[TABLE]

Recall now that $\bar{\zeta}_{n}=\zeta_{n}+M^{-2}\|\widehat{\bm{\Sigma}}_{n}-\bm{\Sigma}\|_{\infty}$ and, according to 5.1, we have

[TABLE]

Using Theorem 6.2, one easily checks that

[TABLE]

This implies that

[TABLE]

Similarly, in view of the Efron-Stein inequality, we have ${\bf Var}[\|\widehat{\bm{\Sigma}}_{n}-\bm{\Sigma}\|_{\infty}]\leq\frac{M^{4}}{2n}$ . This implies that

[TABLE]

Combining (57), (60) and (54), we get the desired result.

5.4 Proof of Proposition 1

Using the strong convexity of the function $u\mapsto\log u$ over the interval $[m,M]$ and the fact that $\bm{\pi}^{*}$ minimizes the convex function $\bm{\pi}\mapsto{\rm KL}(f^{*}||f_{\bm{\pi}})$ , we get

[TABLE]

Combining with (50), in which we replace $\bm{\pi}$ by $\bm{\pi}^{*}$ , we get

[TABLE]

Let us set $\bm{v}=\widehat{\bm{\pi}}-\bm{\pi}^{*}$ . If $\bm{v}=0$ , then the claims are trivial. In the rest of this proof, we assume $\|\bm{v}\|_{1}>0$ . In view of (43), we have $\|\bm{v}\|_{1}\leq 2\|\bm{v}_{J^{*}}\|_{1}$ . Therefore, using the definition of the compatibility factor, we get

[TABLE]

We have already checked that $\bar{\zeta}_{n}\leq(8V^{3}+1)\big{(}\frac{\log(K/\delta)}{n}\big{)}^{\nicefrac{{1}}{{2}}}$ with probability larger than $1-2\delta$ . Dividing both sides of inequality (63) by $\|\bm{v}\|_{1}$ and using the aforementioned upper bound on $\bar{\zeta}_{n}$ , we get the desired bound on $\|\bm{v}\|_{1}=\|\widehat{\bm{\pi}}-\bm{\pi}^{*}\|_{1}$ .

In order to bound the error $\bm{v}=\widehat{\bm{\pi}}-\bm{\pi}^{*}$ in the Euclidean norm, we denote by $\widehat{J}$ the set of $D=|J^{*}|$ indices corresponding to $D$ largest entries of the vector $(|v_{1}|,\ldots,|v_{K}|)$ . Since $\|\bm{v}\|_{1}\leq 2\|\bm{v}_{J^{*}}\|_{1}$ , we clearly have $\|\bm{v}\|_{1}\leq 2\|\bm{v}_{\widehat{J}}\|_{1}$ . Therefore,

[TABLE]

Combining this inequality with the definition of the restricted eigenvalue and inequality (62) above, we arrive at

[TABLE]

Dividing both sides by $\|\bm{v}_{\widehat{J}}\|_{2}$ , taking the square and using (67), we get

[TABLE]

This inequality, in conjunction with the upper bound on $\bar{\zeta}_{n}$ used above, completes the proof of the second claim.

5.5 Proof of Proposition 2

We repeat the proof of Theorem 2.1 with some small modifications. First of all, we replace the function $\ell(u)=-\log(u)$ by the function

[TABLE]

One easily checks that this function is twice continuously differentiable with a second derivative satisfying $M^{-2}\leq\bar{\ell}^{\prime\prime}(u)\leq\mu^{-2}$ for every $u\in(0,M)$ . Furthermore, since $\bar{\ell}(u)=\ell(u/\mu)$ for every $u\geq\mu$ , we have $\bar{L}_{n}(\widehat{\bm{\pi}})=L_{n}(\widehat{\bm{\pi}})$ , where we have used the notation $\bar{L}_{n}(\bm{\pi})=\frac{1}{n}\sum_{i=1}^{n}\bar{\ell}(f_{\bm{\pi}}(\bm{X}_{i}))$ . Therefore, similarly to (33), we get

[TABLE]

for every $\bm{\pi}\in\Pi^{*}(\mu)$ . Let us define $\bar{\varphi}(\bm{\pi},\bm{x})=\bar{\ell}(f_{\bm{\pi}}(\bm{x}))-\int\bar{\ell}(f_{\bm{\pi}})f^{*}d\nu$ and $\bar{\Phi}_{n}(\bm{\pi})=\frac{1}{n}\sum_{i=1}^{n}\bar{\varphi}(\bm{\pi},\bm{X}_{i})$ . We have

[TABLE]

Notice that $\bm{\pi}\in\Pi^{*}(\mu)$ implies that $\bar{\ell}(f_{\bm{\pi}})=\log\mu-\log f_{\bm{\pi}}$ and that $\bar{\ell}(f_{\widehat{\bm{\pi}}})\geq\log\mu-\log f_{\widehat{\bm{\pi}}}-(\log\mu-\log f_{\widehat{\bm{\pi}}})_{+}$ . Therefore, along the lines of the proof of (14) (see, namely, (46)), we get

[TABLE]

We can repeat now the arguments of 5.1 with some minor modifications. We first rewrite $\xi_{n}$ as $\xi_{n}=\max_{l=1,\ldots,K}\xi_{l,n}$ with $\xi_{l,n}=\sup_{\bm{\pi}\in\Pi_{n}(0)}|\partial_{l}\bar{\Phi}_{n}(\bm{\pi})|$ . One checks that the bounded difference inequality and the Efron-Stein inequality can be applied with an additional factor 2, since for $F_{l}(\mathbf{X})=\sup_{\bm{\pi}\in\Pi_{n}(0)}|\partial_{l}\bar{\Phi}_{n}(\bm{\pi})|$ , we have

[TABLE]

Therefore, for every $l\in[K]$ , with probability larger than $1-(\delta/K)$ , we have $\xi_{l,n}\leq\mathbf{E}[\xi_{l,n}]+V(\frac{2\log(K/\delta)}{n})^{\nicefrac{{1}}{{2}}}$ and ${\bf Var}[\xi_{n}]\leq(2V)^{2}/n$ . By the union bound, we obtain that with probability larger than $1-\delta$ , $\xi_{n}\leq\max_{l}\mathbf{E}[\xi_{l,n}]+V(\frac{2\log(K/\delta)}{n})^{\nicefrac{{1}}{{2}}}$ . Thus, to upper bound $\mathbf{E}[\xi_{l,n}]$ , we use the symmetrization argument:

[TABLE]

Note that the function $\bar{\ell}^{\prime}$ , the derivative of $\bar{\ell}$ defined in (70), is by construction Lipschitz with constant $1/\mu^{2}$ . Therefore, in view of the contraction principle,

[TABLE]

As a consequence, we proved that with probability larger than $1-\delta$ , we have $\xi_{n}\leq 8V^{2}(\frac{\log K}{n})^{\nicefrac{{1}}{{2}}}$ . This completes the proof of the first inequality. In order to prove the second one, we simply change the way we have evaluated the term $\int\bar{\ell}(f_{\widehat{\bm{\pi}}})f^{*}$ in the left hand side of (72). Since $\bar{\ell}$ is strongly convex with a second order derivative bounded from below by $1/M^{2}$ , we have $\bar{\ell}(f_{\widehat{\bm{\pi}}})\geq\bar{\ell}(f^{*})+\bar{\ell}^{\prime}(f^{*})(f_{\widehat{\bm{\pi}}}-f^{*})+\frac{1}{2M^{2}}(f_{\widehat{\bm{\pi}}}-f^{*})^{2}$ . Since $f^{*}$ is always larger than $\mu$ , the derivative $\bar{\ell}^{\prime}(f^{*})$ equals $1/f^{*}$ . Integrating over $\mathcal{X}$ , we get the second inequality of the proposition.

5.6 Auxiliary results

We start by a general convex result based on the strong convexity of the $-log$ function to derive a bound on the estimated log-likelihood.

Lemma 5.2.

Let us assume that $M=\max_{j\in[K]}\|f_{j}\|_{\infty}<\infty$ . Then, for any $\bm{\pi}\in{\mathbb{B}}^{K}_{+}$ , it holds that

[TABLE]

Proof.

Recall that $\widehat{\bm{\pi}}$ minimizes the function $L_{n}$ defined in (8) over $\Pi_{n}$ . Furthermore, the function $u\mapsto\ell(u)$ is clearly strongly convex with a second order derivative bounded from below by $1/M^{2}$ over the set $u\in(0,M]$ . Therefore, for every $\widehat{u}\in(0,M]$ , the function $\widetilde{\ell}$ given by:

[TABLE]

is convex. This implies that the mapping

[TABLE]

is convex over the set $\bm{\pi}\in{\mathbb{B}}^{K}_{+}$ . This yields444We denote by $\partial g$ the sub-differential of a convex function $g$ .

[TABLE]

Using the Karush-Kuhn-Tucker conditions and the fact that $\widehat{\bm{\pi}}$ minimizes $L_{n}$ , we get $\mathbf{0}_{K}\in\partial\,L_{n}(\widehat{\bm{\pi}})=\partial\,\widetilde{L}_{n}(\widehat{\bm{\pi}})$ . This readily gives $\widetilde{L}_{n}(\bm{\pi})-\widetilde{L}_{n}(\widehat{\bm{\pi}})\geq 0$ , for any $\bm{\pi}\in{\mathbb{B}}^{K}_{+}$ . The last step is to remark that $\mathbf{Z}(\widehat{\bm{\pi}}-\bm{\pi})=\bar{\mathbf{Z}}(\widehat{\bm{\pi}}-\bm{\pi})$ , since both $\widehat{\bm{\pi}}$ and $\bm{\pi}$ have entries summing to one. ∎

The core of our results lies in the following proposition which bound the deviations of the empirical process part.

Proposition 5.1 (Supremum of Empirical Process).

For any $\bm{\pi}\in{\mathbb{B}}^{K}_{+}$ and $\bm{x}\in\mathcal{X}$ , define $\varphi(\pi,\bm{x})=\int(\log f_{\bm{\pi}})f^{*}-\log f_{\bm{\pi}}(\bm{x})$ and consider $\Phi_{n}(\bm{\pi})=\frac{1}{n}\sum_{i=1}^{n}\varphi(\bm{\pi},\bm{X}_{i})$ . If $K\geq 2$ , then for any $\delta\in(0,1)$ , with probability at least $1-\delta$ , we have

[TABLE]

Furthermore, we have $\mathbf{E}[\zeta_{n}]\leq 4V^{3}\big{(}\frac{2\log(2K^{2})}{n}\big{)}^{\nicefrac{{1}}{{2}}}$ and ${\bf Var}[\zeta_{n}]\leq V^{2}/(2n)$ .

Proof.

To ease notation, let us denote $g_{\bm{\pi},l}(x)=\frac{f_{l}(x)}{f_{\bm{\pi}}(x)}-\mathbf{E}\big{[}\frac{f_{l}(\bm{X})}{f_{\bm{\pi}}(\bm{X})}\big{]}$ and

[TABLE]

where $\mathbf{X}=(\bm{X}_{1},\dots,\bm{X}_{n})$ . To derive a bound on $F$ , we will use the McDiarmid concentration inequality that requires the bounded difference condition to hold for $F$ . For some $i_{0}\in[n]$ , let $\mathbf{X}^{\prime}=(\bm{X}_{1},\dots,\bm{X}^{\prime}_{i_{0}},\dots,\bm{X}_{n})$ be a new sample obtained from $\mathbf{X}$ by modifying the $i_{0}$ -th element $\bm{X}_{i}$ and by leaving all the others unchanged. Then, we have

[TABLE]

where the last inequality is a direct consequence of assumption (12). Therefore, using the McDiarmid concentration inequality recalled in Theorem 6.3 below, we check that the inequality

[TABLE]

holds with probability at least $1-\delta$ . Furthermore, in view of the Efron-Stein inequality, we have

[TABLE]

Let us denote $\mathcal{G}:=\{(f_{l}/f_{\bm{\pi}})-1,(\bm{\pi},l)\in\Pi_{n}\times[K]\}$ and $\mathfrak{R}_{n,q}(\mathcal{G})$ the Rademacher complexity of $\mathcal{G}$ given by

[TABLE]

with $\epsilon_{1},\dots,\epsilon_{n}$ independent and identically distributed Rademacher random variables independent of $\bm{X}_{1},\dots,\bm{X}_{n}$ . Using the symmetrization inequality (see, for instance, Theorem 2.1 in Koltchinskii (2011)) we have

[TABLE]

Lemma 5.3.

The Rademacher complexity defined in (93) satisfies

[TABLE]

Proof.

The proof relies on the contraction principle of Ledoux and Talagrand (1991) that we recall in Appendix A: Concentration inequalities for the convenience. We apply this principle to the random variables $X_{i,(\bm{\pi},l)}=f_{\bm{\pi}}(\bm{X}_{i})/f_{l}(\bm{X}_{i})-1$ and to the function $\psi(x)=(1+x)^{-1}-1$ . Clearly $\psi$ is Lipschitz on $[\frac{1}{V}-1,V-1]$ with the Lipschitz constant equal to $V^{2}$ and $\psi(0)=0$ . Therefore

[TABLE]

Expanding $f_{\bm{\pi}}(\bm{X}_{i})$ we obtain

[TABLE]

We apply now Theorem 6.2 with $s=(k,l)$ , $N=K^{2}$ , $a=-V$ , $b=V$ and $Y_{i,s}=\epsilon_{i}\big{(}\frac{f_{k}(\bm{X}_{i})}{f_{l}(\bm{X}_{i})}-1\big{)}$ . This yields

[TABLE]

This completes the proof of the lemma. ∎

Combining inequalities (91,94) and Lemma 5.3, we get that the inequality

[TABLE]

holds with probability at least $1-\delta$ . Noticing that $V\geq 1$ and, for $K\geq 2$ , $\delta\in(0,K^{-1/31})$ we have $8\sqrt{\log K}+\sqrt{(\nicefrac{{1}}{{2}}){\log(1/\delta)}}\leq 8\sqrt{\log(K/\delta)}$ , we get the first claim of the proposition. The second claim is a direct consequence of Lemma 5.3 and (94). ∎

6 Proof of the lower bound for nearly- $D$ -sparse aggregation

We prove the minimax lower bound for estimation in Kullback-Leibler risk using the following slightly adapted version of Theorem 2.5 from Tsybakov (2009). Throughout this section, we denote by $\lambda_{\min,\bm{\Sigma}}(k)$ and $\lambda_{\max,\bm{\Sigma}}(k)$ , respectively, the smallest and the largest eigenvalue of all $k\times k$ principal minors of the matrix $\bm{\Sigma}$ .

Theorem 6.1.

For some integer $L\geq 4$ assume that $\mathcal{H}_{\mathcal{F}}(\gamma,D)$ contains $L$ elements $f_{\bm{\pi}^{(1)}},\dots,f_{\bm{\pi}^{(L)}}$ satisfying the following two conditions.

(i)

${\rm KL}(f_{\bm{\pi}^{(j)}}||f_{\bm{\pi}^{(k)}})\geq 2s>0$ , for all pairs $(j,k)$ such that $1\leq j<k\leq L$ . 2. (ii)

For product densities $f_{\ell}^{n}$ defined on $\mathcal{X}^{n}$ by $f_{\ell}^{n}(\bm{x}_{1},\ldots,\bm{x}_{n})=f_{\bm{\pi}^{(\ell)}}(\bm{x}_{1})\times\ldots\times f_{\bm{\pi}^{(\ell)}}(\bm{x}_{n})$ it holds

[TABLE]

Then

[TABLE]

To establish the bound claimed in Theorem 3.1, we will split the problem into two parts, corresponding to the following two subsets of $\mathcal{H}_{\mathcal{F}}(\gamma,D)$

[TABLE]

We will show that over $\mathcal{H}_{\mathcal{F}}(0,D)$ , we have a lower bound of order $\log(1+K/D)/n$ while over $\mathcal{H}_{\mathcal{F}}(\gamma,1)$ , a lower bound of order $\big{[}\frac{\gamma^{2}}{n}\log\big{(}1+K/(\gamma\sqrt{n})\big{)}\big{]}^{\nicefrac{{1}}{{2}}}$ holds true. Therefore, the lower bound over $\mathcal{H}_{\mathcal{F}}(\gamma,D)$ is larger than the average of these bounds.

For any $M\geq 1$ and $k\in[M-1]$ , let $\Omega_{k}^{M}$ be the subset of $\{0,1\}^{M}$ defined by

[TABLE]

Before starting, we remind here a version of the Varshamov-Gilbert lemma (see, for instance, (Rigollet and Tsybakov, 2011, Lemma 8.3)) which will be helpful for deriving our lower bounds.

Lemma 6.1.

Let $M\geq 4$ and $k\in[M/2]$ be two integers. Then there exist a subset $\Omega\subset\Omega_{k}^{M}$ and an absolute constant $C_{1}$ such that

[TABLE]

and $L=|\Omega|$ satisfies $L\geq 4$ and

[TABLE]

We will also use the following lemma that allows us to relate the KL-divergence ${\rm KL}(f_{\bm{\pi}}||f_{\bm{\pi}^{\prime}})$ to the Euclidean distance between the weight vectors $\bm{\pi}$ and $\bm{\pi}^{\prime}$ .

Lemma 6.2.

If the dictionary $\mathcal{F}$ satisfies the boundedness assumption (12), then for any $f_{\bm{\pi}},f_{\bm{\pi}^{\prime}}\in\mathcal{H}_{\mathcal{F}}(\gamma,D)$ we have

[TABLE]

Proof.

Using the Taylor expansion, one can check that for any $u\in[1/L,L]$ , we have $(1-u)+\frac{1}{2V^{2}}(u-1)^{2}\leq-\log u\leq(1-u)+\frac{V^{2}}{2}(u-1)^{2}$ . Therefore,

[TABLE]

Since $\mathcal{F}$ satisfies the boundedness assumption, we get

[TABLE]

The claim of the lemma follows from these inequalities and the fact that $\int_{\mathcal{X}}\big{(}f_{\bm{\pi}^{\prime}}-f_{\bm{\pi}}\big{)}^{2}d\nu=\|\bm{\Sigma}^{\nicefrac{{1}}{{2}}}(\bm{\pi}^{\prime}-\bm{\pi})\|_{2}^{2}$ . ∎

6.1 Lower bound on $\mathcal{H}_{\mathcal{F}}(0,D)$

We show here that the lower bound ${(\nicefrac{{D}}{{n}})\log(1+\nicefrac{{eK}}{{D}})}\wedge\big{(}(\nicefrac{{1}}{{n}}){\log(1+\nicefrac{{K}}{{\sqrt{n}}})}\big{)}^{\nicefrac{{1}}{{2}}}$ holds when we consider the worst case error for $f^{*}$ belonging to the set $\mathcal{H}_{\mathcal{F}}(0,D)$ .

Proposition 3.

If $\log(1+eK)\leq n$ then, for the constant

[TABLE]

we have

[TABLE]

Proof.

We assume that $D\leq\nicefrac{{K}}{{2}}$ . The case $D>\nicefrac{{K}}{{2}}$ can be reduced to the case $D=\nicefrac{{K}}{{2}}$ by using the inclusion $\mathcal{H}_{\mathcal{F}}(0,\nicefrac{{K}}{{2}})\subset\mathcal{H}_{\mathcal{F}}(0,D)$ . Let us set $A_{1}=4\vee{16V^{2}\lambda_{\max,\bm{\Sigma}}(2D)}/{(C_{1}m)}$ and denote by $d$ the largest integer such that

[TABLE]

According to Lemma 6.1, there exists a subset $\Omega=\{\bm{\omega}^{(\ell)}:\ell\in[L]\}$ of $\Omega_{d}^{K}$ of cardinality $L\geq 4$ satisfying $\log L\geq{C_{1}d}\log(1+{eK}/{d})$ such that for any pair of distinct elements $\bm{\omega}^{(\ell)}$ , $\bm{\omega}^{(\ell^{\prime})}\in\Omega$ we have $\|\bm{\omega}^{(\ell)}-\bm{\omega}^{(\ell^{\prime})}\|_{1}\geq d/4$ . Using these binary vectors $\bm{\omega}^{(\ell)}$ , we define the set $\mathcal{D}=\{\bm{\pi}^{(1)},\dots,\bm{\pi}^{(L)}\}\subset{\mathbb{B}}_{+}^{K}$ as follows:

[TABLE]

Clearly, for every $\varepsilon\in[0,1]$ , the vectors $\bm{\pi}^{(\ell)}$ belong to ${\mathbb{B}}^{K}_{+}$ . Furthermore, for any pair of distinct values $\ell,\ell^{\prime}\in[L]$ , we have $\|\bm{\pi}^{(\ell)}-\bm{\pi}^{(\ell^{\prime})}\|_{q}^{q}=(\varepsilon/d)^{q}\|\bm{\omega}^{(\ell)}-\bm{\omega}^{(\ell^{\prime})}\|_{1}\geq(\varepsilon/d)^{q}d/4$ . In view of Lemma 6.2, this yields

[TABLE]

Let us choose

[TABLE]

It follows from (114) that $\varepsilon\leq 1$ . Inserting this value of $\varepsilon$ in (116), we get

[TABLE]

This shows that condition (i) of Theorem 6.1 is satisfied with $s=C_{2}\,(\nicefrac{{d}}{{n}})\log(1+eK/d)$ . For the second condition of the same theorem, we have

[TABLE]

since one can check that $\|\bm{\pi}^{(\ell)}-\bm{\pi}^{(1)}\|_{2}^{2}\leq(\varepsilon/d)^{2}\|\bm{\omega}^{(\ell)}-\bm{\omega}^{(1)}\|_{1}\leq 2\varepsilon^{2}/d$ . Therefore, using the definition of $\varepsilon$ , we get

[TABLE]

Theorem 6.1 implies that

[TABLE]

We use the fact that $d$ is the largest integer satisfying (114). Therefore, either $d+1>D$ or

[TABLE]

If $d\geq D$ , then the claim of the proposition follows from (124), since $d\log(1+eK/d)\geq D\log(1+eK/D)$ . On the other hand, if (125) is true, then

[TABLE]

In addition, $d^{2}\log(1+eK/d)\leq A_{1}n$ implies that $(d+1)^{2}\leq A_{1}n$ . Combining the last two inequalities, we get the inequality $d\log(1+eK/d)\geq\nicefrac{{1}}{{2}}\big{(}{A_{1}n}{\log(1+eK/\sqrt{A_{1}n})}\big{)}^{\nicefrac{{1}}{{2}}}\geq\big{(}{n}{\log(1+eK/\sqrt{n})}\big{)}^{\nicefrac{{1}}{{2}}}$ . Therefore, in view of (124), we get the claim of the proposition. ∎

6.2 Lower bound on $\mathcal{H}_{\mathcal{F}}(\gamma,1)$

Next result shows that the lower bound ${\frac{\gamma^{2}}{n}\log\big{(}1+\frac{K}{\gamma\sqrt{n}}\big{)}}$ holds for the worst case error when $f^{*}$ belongs to the set $\mathcal{H}_{\mathcal{F}}(\gamma,1)$ .

Proposition 4.

Assume that

[TABLE]

Then, for the constant $C_{3}=\frac{C_{1}m\bar{\kappa}_{\bm{\Sigma}}(2D,0)}{2^{12}V^{4}M\lambda_{\max,\bm{\Sigma}}(2D)}$ , it holds that

[TABLE]

Proof.

Let $C>2$ be a constant the precise value of which will be specified later. Denote by $d$ the largest integer satisfying

[TABLE]

Note that $d\geq 1$ in view of the condition $(\frac{\log(1+eK)}{n})^{\nicefrac{{1}}{{2}}}\leq 2\gamma$ of the proposition. This readily implies that $d\leq C\gamma\sqrt{n}$ and, therefore,

[TABLE]

Let us first consider the case $d\leq(K-1)/2$ . According to Lemma 6.1, there exists a subset $\Omega\subset\Omega_{d}^{K-1}$ of cardinality $L$ satisfying $\log L\geq C_{1}\log\big{(}1+\frac{e(K-1)}{d}\big{)}$ and $\|\bm{\omega}^{(\ell)}-\bm{\omega}^{(\ell^{\prime})}\|_{1}\geq d/4$ for any pair of distinct elements $\bm{\omega},\bm{\omega}^{\prime}$ taken from $\Omega$ . With these binary vectors in hand, we define the set $\mathcal{D}\subset{\mathbb{B}}_{+}^{K}$ of cardinality $L$ as follows:

[TABLE]

It is clear that all the vectors of $\mathcal{D}$ belong to $\mathcal{H}_{\mathcal{F}}(\gamma,1)$ . Let us fix now an element of $\mathcal{D}$ and denote it by $\bm{\pi}^{1}$ , the corresponding element of $\Omega$ being denoted by $\bm{\omega}^{1}$ . We have

[TABLE]

The definition of $d$ yields $(d+1)\sqrt{\log(1+eK/(d+1))}>C\gamma\sqrt{n}$ , which implies that

[TABLE]

Combined with eq. 134, this implies that

[TABLE]

Choosing

[TABLE]

we get that $\max_{\bm{\pi}\in\mathcal{D}}{\rm KL}(f^{n}_{\bm{\pi}}||f^{n}_{\bm{\pi}^{1}})\leq\frac{1}{16}C_{1}{d\log\big{(}1+e(K-1)/d\big{)}}\leq\frac{\log L}{16}$ .

Furthermore, for any $\bm{\pi},\bm{\pi}^{\prime}\in\mathcal{D}$ , in view of Lemma 6.2 and (130), we have

[TABLE]

Since $\frac{\bar{\kappa}_{\bm{\Sigma}}(2d,0)}{32V^{2}MC^{2}}=2C_{3}$ , this implies that Theorem 6.1 can be applied, which leads to the inequality

[TABLE]

To complete the proof of the proposition, we have to consider the case $d>(K-1)/2$ . In this case, we can repeat all the previous arguments for $d=K/2$ and get the desired inequality. ∎

6.3 Lower bound holding for all densities

Now that we have lower bounds in probability for $\mathcal{H}_{\mathcal{F}}(0,D)$ and $\mathcal{H}_{\mathcal{F}}(\gamma,1)$ , we can derive a lower bound in expectation for $\mathcal{H}_{\mathcal{F}}(\gamma,D)$ . In particular, to prove Theorem 3.1, we will use the inequality

[TABLE]

Proof of Theorem 3.1.

To ease notation, let us define

[TABLE]

We first consider the case where the dominating term is the first one, that is

[TABLE]

On the one hand, since $D\geq 1$ , we have

[TABLE]

On the other hand, using the inequality $\log(1+x)\leq x$ , we get

[TABLE]

Combining (144), (145) and (148), we get

[TABLE]

This implies that we can apply Proposition 4, which yields

[TABLE]

In view of (144), this implies that

[TABLE]

We now consider the second case, where the dominating term in the rate is the second one, that is

[TABLE]

In view of Proposition 3, we have

[TABLE]

In view of (152), we get

[TABLE]

Thus, we have proved that $\log(1+eK)\leq n$ implies that $\inf_{\widehat{f}}\sup_{f\in\mathcal{H}_{\mathcal{F}}(\gamma,D)}\mathbf{P}\!_{f}\big{(}{\rm KL}(f||\widehat{f})\geq C_{4}\,r(n,K,\gamma,D)\big{)}\geq 0.17$ for some constant $C_{4}>0$ , whatever the relation between $\gamma$ and $D$ . The desired lower bound follows now from the Tchebychev inequality $\mathbf{E}\big{[}{\rm KL}(f||\widehat{f})\big{]}\geq C_{4}\,r(n,K,\gamma,D)\mathbf{P}\!_{f}\big{(}{\rm KL}(f||\widehat{f})\geq C_{4}\,r(n,K,\gamma,D)\big{)}$ . ∎

Appendix A: Concentration inequalities

This section contains some well-known results, which are recalled here for the sake of the self-containedness of the paper.

Theorem 6.2.

For each $s=1,\ldots,N$ , let $Y_{1,s},\ldots,Y_{n,s}$ be $n$ independent and zero mean random variables such that for some real numbers $a,b$ we have $\mathbf{P}(Y_{i,s}\in[a,b])=1$ for all $i\in[n]$ and $s\in[N]$ . Then, we have

[TABLE]

Proof.

We denote $Z_{s}=\frac{1}{n}\sum_{i=1}^{n}Y_{i,s}$ for $s=1,\ldots,N$ and $Z_{s}=-\frac{1}{n}\sum_{i=1}^{n}Y_{i,s}$ for $s=N+1,\ldots,2N$ . For every $s\in[2N]$ , the logarithmic moment generating function $\psi_{s}(\lambda)=\log\mathbf{E}[e^{\lambda Z_{s}}]$ satisfies

[TABLE]

where the last inequality is a consequence of the Hoeffding lemma (see, for instance, Lemma 2.2 in (Boucheron et al., 2013)). This means that $Z_{s}$ is sub-Gaussian with variance-factor $\nu={(b-a)^{2}}/{4n}$ . Therefore, Theorem 2.5 from (Boucheron et al., 2013) yields $\mathbf{E}[\max_{s}Z_{s}]\leq\sqrt{2\nu\log(2N)}$ , which completes the proof. ∎

We group and state together the bounded differences and the Efron-Stein inequalities (Boucheron et al. (2013), Theorems 6.2 and 3.1, respectively).

Theorem 6.3.

Assume that a function f satisfies the bounded difference condition: there exist constants $c_{i}$ , $i=1,\ldots,n$ such that for all $i=1,\ldots,n$ , all $X=(X_{1},\dots,X_{i},\dots,X_{n})$ and $X^{\prime}=(X_{1},\dots,X^{\prime}_{i},\dots,X_{n})$ where only the $i^{th}$ vector is changed

[TABLE]

Denote

[TABLE]

Let $Z=f(X_{1},\dots,X_{n})$ where $X_{i}$ are independent. Then, for every $\delta\in(0,1)$ ,

[TABLE]

Next we state the contraction principle of (Ledoux and Talagrand, 1991); a proof can be found in (Boucheron et al. (2013), Theorem 11.6).

Theorem 6.4.

Let $x_{1},\dots,x_{n}$ be vectors whose real-valued components are indexed by $\bm{\mathcal{T}}$ , that is, $x_{i}=(x_{i,s})_{s\in\bm{\mathcal{T}}}$ . For each $i=1,\dots,n$ let $\varphi_{i}:{\mathbb{R}}\rightarrow{\mathbb{R}}$ be a $1$ -Lipschitz function such that $\varphi_{i}(0)=0$ . Let $\epsilon_{1},\dots,\epsilon_{n}$ be independent Rademacher random variables, and let $\Psi:[0,\infty)\rightarrow{\mathbb{R}}$ be a non-decreasing convex function. Then

[TABLE]

Acknowledgments

The work of M.S. was partially supported by the French “Agence Nationale de la Recherche”, CIFRE no 2014/0517, and by ARTEFACT (www.artefact.is). The work of A.D. was partially supported by the grant Investissements d’Avenir (ANR-11-IDEX-0003/Labex Ecodec/ANR-11-LABX-0047) and the chair “LCL/GENES/Fondation du risque, Nouveaux enjeux pour nouvelles données”.

Bibliography34

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bellec [2014] P. C. Bellec. Optimal exponential bounds for aggregation of density estimators. Technical report, ar Xiv:1405.3907, May 2014.
2Bellec et al. [2016] Pierre C. Bellec, Arnak S. Dalalyan, Edwin Grappin, and Quentin Paris. On the prediction loss of the lasso in the partially labeled setting. Technical report, ar Xiv:1606.06179, June 2016.
3Bertin et al. [2011] K. Bertin, E. Le Pennec, and V. Rivoirard. Adaptive Dantzig density estimation. Ann. Inst. Henri Poincaré Probab. Stat. , 47(1):43–74, 2011.
4Boucheron et al. [2013] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence . OUP Oxford, 2013. ISBN 9780199535255.
5Bunea et al. [2007] Florentina Bunea, Alexandre B. Tsybakov, and Marten H. Wegkamp. Aggregation for gaussian regression. Ann. Statist. , 35(4):1674–1697, 08 2007.
6Bunea et al. [2010] Florentina Bunea, Alexandre B. Tsybakov, Marten H. Wegkamp, and Adrian Barbu. Spades and mixture models. Ann. Statist. , 38(4):2525–2558, 2010.
7Butucea et al. [2016] C. Butucea, J.-F. Delmas, A. Dutfoy, and R. Fischer. Optimal exponential bounds for aggregation of estimators for the Kullback-Leibler loss. Technical report, ar Xiv:1601.05686, January 2016.
8Catoni [1997] O Catoni. The mixture approach to universal model selection. Technical report, 1997. URL http://cds.cern.ch/record/461892 .

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Optimal Kullback-Leibler Aggregation in Mixture Density Estimation by Maximum Likelihood

Abstract

1 Introduction

1.1 Related work

Theorem 1.1** (Theorem 5.1 in (Li, 1999)).**

Theorem 1.2** (page 226 in (Rakhlin et al., 2005)).**

1.2 Additional notation

1.3 Agenda

2 Oracle inequalities in deviation and in expectation

Theorem 2.1**.**

Theorem 2.2**.**

Theorem 2.3** (Bound in Expectation).**

3 Discussion of the conditions and possible extensions

3.1 Lower bounds for nearly-DDD-sparse aggregation

Theorem 3.1**.**

3.2 Weight vector estimation

Proposition 1**.**

3.3 Extensions to the case of vanishing components

Proposition 2**.**

4 Conclusion

5 Proofs of results stated in previous sections

5.1 Proof of Theorem 2.1

Lemma 5.1** (Bellec et al. (2016), Lemma 2).**

5.2 Proof of Theorem 2.2

5.3 Proof of Theorem 2.3

5.4 Proof of Proposition 1

5.5 Proof of Proposition 2

5.6 Auxiliary results

Lemma 5.2**.**

Proof.

Proposition 5.1** (Supremum of Empirical Process).**

Proof.

Lemma 5.3**.**

Proof.

6 Proof of the lower bound for nearly-DDD-sparse aggregation

Theorem 6.1**.**

Lemma 6.1**.**

Lemma 6.2**.**

Proof.

6.1 Lower bound on HF(0,D)\mathcal{H}_{\mathcal{F}}(0,D)HF​(0,D)

Proposition 3**.**

Proof.

6.2 Lower bound on HF(γ,1)\mathcal{H}_{\mathcal{F}}(\gamma,1)HF​(γ,1)

Proposition 4**.**

Proof.

6.3 Lower bound holding for all densities

Proof of Theorem 3.1.

Appendix A: Concentration inequalities

Theorem 6.2**.**

Proof.

Theorem 6.3**.**

Theorem 6.4**.**

Acknowledgments

Theorem 1.1 (Theorem 5.1 in (Li, 1999)).

Theorem 1.2 (page 226 in (Rakhlin et al., 2005)).

Theorem 2.1.

Theorem 2.2.

Theorem 2.3 (Bound in Expectation).

3.1 Lower bounds for nearly- $D$ -sparse aggregation

Theorem 3.1.

Proposition 1.

Proposition 2.

Lemma 5.1 (Bellec et al. (2016), Lemma 2).

Lemma 5.2.

Proposition 5.1 (Supremum of Empirical Process).

Lemma 5.3.

6 Proof of the lower bound for nearly- $D$ -sparse aggregation

Theorem 6.1.

Lemma 6.1.

Lemma 6.2.

6.1 Lower bound on $\mathcal{H}_{\mathcal{F}}(0,D)$

Proposition 3.

6.2 Lower bound on $\mathcal{H}_{\mathcal{F}}(\gamma,1)$

Proposition 4.

Theorem 6.2.

Theorem 6.3.

Theorem 6.4.