Discrete minimax estimation with trees

Luc Devroye; Tommy Reddad

arXiv:1812.06063·math.ST·July 1, 2019

Discrete minimax estimation with trees

Luc Devroye, Tommy Reddad

PDF

TL;DR

This paper introduces a recursive partitioning method for density estimation that achieves optimal $L_1$ minimax rates for certain discrete nonparametric classes, enhancing nonparametric density estimation techniques.

Contribution

The paper presents a novel recursive partitioning scheme that effectively estimates densities and attains optimal minimax rates in discrete settings.

Findings

01

Achieves optimal $L_1$ minimax rates for specific classes.

02

Provides a simple, recursive data-based partitioning approach.

03

Demonstrates effectiveness in discrete nonparametric density estimation.

Abstract

We propose a simple recursive data-based partitioning scheme which produces piecewise-constant or piecewise-linear density estimates on intervals, and show how this scheme can determine the optimal $L_{1}$ minimax rate for some discrete nonparametric classes.

Figures4

Click any figure to enlarge with its caption.

Equations361

P {X \in A} = \int_{A} f d μ, for each A \in Σ.

P {X \in A} = \int_{A} f d μ, for each A \in Σ.

P {X \in A} = x \in A \sum f (x), for each A \in P (X) .

P {X \in A} = x \in A \sum f (x), for each A \in P (X) .

\hat{f}_{n} : X^{n} \to R^{X},

\hat{f}_{n} : X^{n} \to R^{X},

TV (μ, ν) = A \in Σ sup ∣ μ (A) - ν (A) ∣.

TV (μ, ν) = A \in Σ sup ∣ μ (A) - ν (A) ∣.

TV (μ, ν)

TV (μ, ν)

= (X, Y) : X \sim f, Y \sim g in f P {X \neq = Y},

∥ h ∥_{1} = x \in X \sum ∣ h (x) ∣.

∥ h ∥_{1} = x \in X \sum ∣ h (x) ∣.

TV (f, g) = ∥ f - g ∥_{1} /2.

TV (f, g) = ∥ f - g ∥_{1} /2.

\mathcal{R}_{n}(\hat{f}_{n},\mathcal{F})=\sup_{f\in\mathcal{F}}\operatorname{\mathbf{E}}\bigl{\{}\operatorname{TV}(\hat{f}_{n}(X_{1},\dots,X_{n}),f)\bigr{\}},

\mathcal{R}_{n}(\hat{f}_{n},\mathcal{F})=\sup_{f\in\mathcal{F}}\operatorname{\mathbf{E}}\bigl{\{}\operatorname{TV}(\hat{f}_{n}(X_{1},\dots,X_{n}),f)\bigr{\}},

R_{n} (F) = \hat{f}_{n} in f R_{n} (\hat{f}_{n}, F) .

R_{n} (F) = \hat{f}_{n} in f R_{n} (\hat{f}_{n}, F) .

f (x + 1) \leq f (x), for all x \in {1, \dots, k - 1} .

f (x + 1) \leq f (x), for all x \in {1, \dots, k - 1} .

f(n,k)=\mathopen{}\left\{\begin{array}[]{ll}\sqrt{k/n}&\mbox{if $2\leq k<2n^{1/3}$,}\\ {\mathopen{}\left(\frac{\log_{2}(k/n^{1/3})}{n}\right)\mathclose{}}^{1/3}&\mbox{if $2n^{1/3}\leq k<n^{1/3}2^{n}$,}\\ 1&\mbox{if $n^{1/3}2^{n}\leq k$.}\end{array}\right.\mathclose{}

f(n,k)=\mathopen{}\left\{\begin{array}[]{ll}\sqrt{k/n}&\mbox{if $2\leq k<2n^{1/3}$,}\\ {\mathopen{}\left(\frac{\log_{2}(k/n^{1/3})}{n}\right)\mathclose{}}^{1/3}&\mbox{if $2n^{1/3}\leq k<n^{1/3}2^{n}$,}\\ 1&\mbox{if $n^{1/3}2^{n}\leq k$.}\end{array}\right.\mathclose{}

\frac{1}{C} \leq \frac{R _{n} ( F _{k} )}{f ( n , k )} \leq C .

\frac{1}{C} \leq \frac{R _{n} ( F _{k} )}{f ( n , k )} \leq C .

f (x) - 2 f (x + 1) + f (x + 2) \geq 0, for all x \in {1, \dots, k - 2} .

f (x) - 2 f (x + 1) + f (x + 2) \geq 0, for all x \in {1, \dots, k - 2} .

g(n,k)=\mathopen{}\left\{\begin{array}[]{ll}\sqrt{k/n}&\mbox{if $2\leq k<3n^{1/5}$ ,}\\ {\mathopen{}\left(\frac{\log_{3}(k/n^{1/5})}{n}\right)\mathclose{}}^{2/5}&\mbox{if $3n^{1/5}\leq k<n^{1/5}3^{n}$,}\\ 1&\mbox{if $n^{1/5}3^{n}\leq k$.}\end{array}\right.\mathclose{}

g(n,k)=\mathopen{}\left\{\begin{array}[]{ll}\sqrt{k/n}&\mbox{if $2\leq k<3n^{1/5}$ ,}\\ {\mathopen{}\left(\frac{\log_{3}(k/n^{1/5})}{n}\right)\mathclose{}}^{2/5}&\mbox{if $3n^{1/5}\leq k<n^{1/5}3^{n}$,}\\ 1&\mbox{if $n^{1/5}3^{n}\leq k$.}\end{array}\right.\mathclose{}

\frac{1}{C} \leq \frac{R _{n} ( G _{k} )}{g ( n , k )} \leq C .

\frac{1}{C} \leq \frac{R _{n} ( G _{k} )}{g ( n , k )} \leq C .

\mathcal{A}_{\Theta}=\Big{\{}\{x\in\mathcal{X}\colon f_{n,\theta}(x)>f_{n,\theta^{\prime}}(x)\}\colon\theta\neq\theta^{\prime}\in\Theta\Big{\}}.

\mathcal{A}_{\Theta}=\Big{\{}\{x\in\mathcal{X}\colon f_{n,\theta}(x)>f_{n,\theta^{\prime}}(x)\}\colon\theta\neq\theta^{\prime}\in\Theta\Big{\}}.

μ_{n} (A) = \frac{1}{n} i = 1 \sum n 1 {X_{i} \in A}, for A \in A_{Θ} .

μ_{n} (A) = \frac{1}{n} i = 1 \sum n 1 {X_{i} \in A}, for A \in A_{Θ} .

TV (ψ_{n}, f) \leq 3 θ \in Θ in f TV (f_{n, θ}, f) + 2 A \in A_{Θ} sup ∣ μ_{n} (A) - μ (A) ∣ + \frac{3}{2 n} .

TV (ψ_{n}, f) \leq 3 θ \in Θ in f TV (f_{n, θ}, f) + 2 A \in A_{Θ} sup ∣ μ_{n} (A) - μ (A) ∣ + \frac{3}{2 n} .

2^{⌊ k /2 ⌋} \leq ∣ A ∣ \leq 2^{k},

2^{⌊ k /2 ⌋} \leq ∣ A ∣ \leq 2^{k},

f \in F sup E {A \in A sup ∣ μ_{n} (A) - μ (A) ∣} \leq c \frac{VC ( A )}{n} .

f \in F sup E {A \in A sup ∣ μ_{n} (A) - μ (A) ∣} \leq c \frac{VC ( A )}{n} .

R_{n} (F) \leq 3 f \in F sup θ \in Θ in f TV (f_{n, θ}, f) + c \frac{VC ( A _{Θ} )}{n} + \frac{3}{2 n} .

R_{n} (F) \leq 3 f \in F sup θ \in Θ in f TV (f_{n, θ}, f) + c \frac{VC ( A _{Θ} )}{n} + \frac{3}{2 n} .

I_{u} = {a_{u}, a_{u} + 1, \dots, a_{u} + ∣ I_{u} ∣ - 1},

I_{u} = {a_{u}, a_{u} + 1, \dots, a_{u} + ∣ I_{u} ∣ - 1},

I_{v}

I_{v}

I_{w}

∣ N_{v} - N_{w} ∣ > N_{v} + N_{w},

∣ N_{v} - N_{w} ∣ > N_{v} + N_{w},

N_{z} = i = 1 \sum n 1 {X_{i} \in I_{z}}, for z \in {v, w} .

N_{z} = i = 1 \sum n 1 {X_{i} \in I_{z}}, for z \in {v, w} .

\hat{f}_{n} (x) = \frac{N _{u}}{n ∣ I _{u} ∣}, if x \in I_{u}, u \in L .

\hat{f}_{n} (x) = \frac{N _{u}}{n ∣ I _{u} ∣}, if x \in I_{u}, u \in L .

\frac{y _{v} ∣ I _{v} ∣ + y _{w} ∣ I _{w} ∣}{∣ I _{v} ∣ + ∣ I _{w} ∣},

\frac{y _{v} ∣ I _{v} ∣ + y _{w} ∣ I _{w} ∣}{∣ I _{v} ∣ + ∣ I _{w} ∣},

TV (\hat{f}_{n}^{'}, f) \leq TV (\hat{f}_{n}, f) .

TV (\hat{f}_{n}^{'}, f) \leq TV (\hat{f}_{n}, f) .

f_{z} = x \in I_{z} \sum f (x) .

f_{z} = x \in I_{z} \sum f (x) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Discrete minimax estimation with trees

Luc Devroyelabel=e1][email protected] [

Tommy Reddadlabel=e2][email protected] [ McGill University

School of Computer Science

McGill University

3480 University Street

Montréal, Québec, Canada

H3A 2A7

Abstract

We propose a simple recursive data-based partitioning scheme which produces piecewise-constant or piecewise-linear density estimates on intervals, and show how this scheme can determine the optimal $L_{1}$ minimax rate for some discrete nonparametric classes.

60G07,

density estimation,

minimax theory,

discrete probability distribution,

Vapnik-Chervonenkis dimension,

monotone density,

convex density,

histogram,

keywords:

[class=MSC]

keywords:

\startlocaldefs\endlocaldefs

and

t1Supported by NSERC Grant A3456. t2Supported by NSERC PGS D scholarship 396164433.

1 Introduction

Density estimation or distribution learning refers to the problem of estimating the unknown probability density function of a common source of independent sample observations. In any interesting case, we know that the unknown source density may come from a known class. In the parametric case, each density in this class can be specified using a bounded number of real parameters, e.g., the class of all Gaussian densities with any mean and any variance. The remaining cases are called nonparametric. Examples of nonparametric classes include bounded monotone densities on $[0,1]$ , $L$ -Lipschitz densities for a given constant $L>0$ , and log-concave densities, to name a few. By minimax estimation, we mean density estimation in the minimax sense, i.e., we are interested in the existence of a density estimate which minimizes its approximation error, even in the worst case.

There is a long line of work in the statistics literature about density estimation, and a growing interest coming from the theoretical computer science and machine learning communities; for a selection of new and old books on this topic, see [13, 14, 16, 22, 32, 33]. The study of nonparametric density estimation began as early as in the 1950’s, when Grenander [20] described and studied properties of the maximum likelihood estimate of an unknown density taken from the class of bounded monotone densities on $[0,1]$ . Grenander’s estimator and this class received much further treatment over the years, in particular by Prakasa Rao [29], Groeneboom [21], and Birgé [7, 8, 9], who identified the optimal $L_{1}$ -error minimax rate up to a constant factor, and also gave an efficient adaptive estimator which worked even when the boundedness parameter was unknown. Since then, countless more nonparametric classes have been studied, and many different all-purpose methods have been developed to obtain minimax results about these classes: for the construction of density estimates, see e.g., the maximum likelihood estimate, skeleton estimates, kernel estimates, and wavelet estimates, to name a few; and for minimax rate lower bounds, see e.g., the methods of Assouad, Fano, and Le Cam [13, 14, 16, 35]. See [5, 10, 18, 24] for recent related works in nonparametric shape-constrained regression.

One very popular style of density estimate is the histogram, in which the support of the random data is partitioned into bins, where each bin receives a weight proportional to the number of data points contained within, and such that the estimate is constant with the given weight along each bin. Then, the selection of the bins themselves becomes critical in the construction of a good histogram estimate. Birgé [8] showed how histograms with carefully chosen exponentially increasing bin sizes will have $L_{1}$ -error within a constant factor of the optimal minimax rate for the class of bounded non-increasing densities on $[0,1]$ . In general, the right choice of an underlying partition for a histogram estimate is not obvious.

In this work, we devise a recursive data-based approach for determining the partition of the support for a histogram estimate of discrete non-increasing densities. We also use a similar approach to build a piecewise-linear estimator for discrete non-increasing convex densities—see Anevski [1], Jongbloed [26], and Groeneboom, Jongbloed, and Wellner [23] for works concerning the maximum likelihood and minimax estimation of continuous non-increasing convex densities. Both of our estimators are minimax-optimal, i.e., their minimax $L_{1}$ -error is within a constant factor of the optimal rate. Recursive data-based partitioning schemes have been extremely popular in density estimation since the 1970’s with Gessaman [19], Chen and Zhao [11], Lugosi and Nobel [28], and countless others, with great interest coming from the machine learning and pattern recognition communities [15]. Still, it seems that most of the literature involving recursive data-based partitions are not especially concerned with the rate of convergence of density estimates, but rather other properties such as consistency under different recursive schemes. Moreover, most of the density estimation literature is concerned with the estimation of continuous probability distributions. In discrete density estimation, not all of the constructions or methods used to develop arguments for analogous continuous classes will neatly apply, and in some cases, there are discrete phenomena that call for a different approach. See Jankowski and Wellner [25] for a recent treatment on the properties of a variety of estimators of discrete non-increasing densities.

2 Preliminaries and summary

Let $\mathcal{F}$ be a given class of probability densities with respect to a base measure $\mu$ on the measurable space $(\mathcal{X},\Sigma)$ , and let $f\in\mathcal{F}$ . If $X$ is a random variable taking values in $(\mathcal{X},\Sigma)$ , we write $X\sim f$ to mean that

[TABLE]

The notation $X_{1},\dots,X_{n}\mathbin{\mathrel{\overset{i.i.d.}{\scalebox{2.0}[1.0]{$ \sim $}}}}f$ means that $X_{i}\sim f$ for each $1\leq i\leq n$ , and that $X_{1},\dots,X_{n}$ are mutually independent.

Typically in density estimation, either $\mathcal{X}=\mathbb{R}^{d}$ , $\Sigma$ is the Borel $\sigma$ -algebra, and $\mu$ is the Lebesgue measure, or $\mathcal{X}$ is countable, $\Sigma=\mathcal{P}(\mathcal{X})$ , and $\mu$ is the counting measure. The former case is referred to as the continuous setting, and the latter case as the discrete setting, where $f$ is more often called a probability mass function in the literature. Throughout this paper, we will only be concerned with the discrete setting, and even so, we still refer to $\mathcal{F}$ as a class of densities, and $f$ as a density itself. Plainly, in this case, $X\sim f$ signifies that

[TABLE]

Let $f\in\mathcal{F}$ be unknown. Given the $n$ samples $X_{1},\dots,X_{n}\mathbin{\mathrel{\overset{i.i.d.}{\scalebox{2.0}[1.0]{$ \sim $}}}}f$ , our goal is to create a density estimate

[TABLE]

such that the probability measures corresponding to $f$ and $\hat{f}_{n}(X_{1},\dots,X_{n})$ are close in total variation (TV) distance, where for any probability measures $\mu,\nu$ , their TV-distance is defined as

[TABLE]

The TV-distance has several equivalent definitions; importantly, if $\mu$ and $\nu$ are probability measures with corresponding densities $f$ and $g$ , then

[TABLE]

where for any function $h\colon\mathcal{X}\to\mathbb{R}$ , we define the $L_{1}$ -norm of $h$ as

[TABLE]

(In the continuous case, this sum is simply replaced with an integral.) In view of the relation between TV-distance and $L_{1}$ -norm in (2.2), we will abuse notation and write

[TABLE]

There are various possible measures of dissimilarity between probability distributions which can be considered in density estimation, e.g., the Hellinger distance, Wasserstein distance, $L_{p}$ -distance, $\chi^{2}$ -divergence, Kullback-Leibler divergence, or any number of other divergences; see Sason and Verdú [31] for a survey on many such functions and the relations between them. Here, we focus on the TV-distance due to its several appealing properties, such as being a metric, enjoying the natural probabilistic interpretation of (2.1), and having the coupling characterization (2.3).

If $\hat{f}_{n}$ is a density estimate, we define the risk of the estimator $\hat{f}_{n}$ with respect to the class $\mathcal{F}$ as

[TABLE]

where the expectation is over the $n$ i.i.d. samples from $f$ , and possible randomization of the estimator. From now on we will omit the dependence of $\hat{f}_{n}$ on $X_{1},\dots,X_{n}$ unless it is not obvious. The minimax risk or minimax rate for $\mathcal{F}$ is the smallest risk over all possible density estimates,

[TABLE]

We can now state our results precisely. Let $k\in\mathbb{N}$ and let $\mathcal{F}_{k}$ be the class of non-increasing densities on $\{1,\dots,k\}$ , i.e., set of of all probability vectors $f\colon\{1,\dots,k\}\to\mathbb{R}$ for which

[TABLE]

Theorem 2.1.

Let $f\colon\mathbb{N}\times\mathbb{N}\to\mathbb{R}$ be

[TABLE]

There is a universal constant $C\geq 1$ such that, for sufficiently large $n$ not depending on $k$ ,

[TABLE]

Let $\mathcal{G}_{k}$ be the class of all non-increasing convex densities on $\{1,\dots,k\}$ , so each $f\in\mathcal{G}_{k}$ satisfies (2.4) and

[TABLE]

Theorem 2.2.

Let $g\colon\mathbb{N}\times\mathbb{N}\to\mathbb{R}$ be

[TABLE]

There is a universal constant $C\geq 1$ such that, for sufficiently large $n$ not depending on $k$ ,

[TABLE]

We emphasize here that the above results give upper and lower bounds on the minimax rates $\mathcal{R}_{n}(\mathcal{F}_{k})$ and $\mathcal{R}_{n}(\mathcal{G}_{k})$ which are within universal constant factors of one another, for the entire range of $k$ .

Our upper bounds will crucially rely on the next results, which allow us to relate the minimax rate of a class to an old and well-studied combinatorial quantity called the Vapnik-Chervonenkis (VC) dimension [34]: For $\mathcal{A}\subseteq\mathcal{P}(\mathcal{X})$ a family of subsets of $\mathcal{X}$ , the VC-dimension of $\mathcal{A}$ , denoted by $\operatorname{VC}(\mathcal{A})$ , is the size of the largest set $X\subseteq\mathcal{X}$ such that for every $Y\subseteq X$ , there exists $B\in\mathcal{A}$ such that $X\cap B=Y$ . See, e.g., the book of Devroye and Lugosi [16] for examples and applications of the VC-dimension in the study of density estimation.

Theorem 2.3 (Devroye, Lugosi [16]).

Let $\mathcal{F}$ be a class of densities supported on $\mathcal{X}$ , and let $\mathcal{F}_{\Theta}=\{f_{n,\theta}\colon\theta\in\Theta\}$ be a class of density estimates satisfying $\sum_{x\in\mathcal{X}}f_{n,\theta}(x)=1$ for every $\theta\in\Theta$ . Let $\mathcal{A}_{\Theta}$ be the Yatracos class of $\mathcal{F}_{\Theta}$ ,

[TABLE]

For $f\in\mathcal{F}$ , let $\mu$ be the probability measure corresponding to $f$ . Let also $\mu_{n}$ be the empirical measure based on $X_{1},\dots,X_{n}\mathbin{\mathrel{\overset{i.i.d.}{\scalebox{2.0}[1.0]{$ \sim $}}}}f$ , where

[TABLE]

Then, there is an estimate $\psi_{n}$ for which

[TABLE]

The estimate $\psi_{n}$ in Theorem 2.3 is called the minimum distance estimate in [16]—we omit the details of its construction, though we emphasize that if computing $\int_{A}f_{n,\theta}$ takes one unit of computation for any $\theta$ and $A$ , then selecting $\psi_{n}$ takes time polynomial in the size of $\mathcal{A}$ , which is often exponential in the quantities of interest; for instance, if $\mathcal{A}$ is the Yatracos class of $\mathcal{F}_{k}$ , then a simple construction shows that $\mathcal{A}$ contains all subsets of $\{1,\dots,2{\lfloor k/2\rfloor}\}$ containing only odd numbers, whence

[TABLE]

where the upper bound is trivial.

Theorem 2.4 (Devroye, Lugosi [16]).

Let $\mathcal{F},\mathcal{X},f,\mu,\mu_{n}$ be as in Theorem 2.3, and let $\mathcal{A}\subseteq\mathcal{P}(\mathcal{X})$ . Then, there is a universal constant $c>0$ for which

[TABLE]

Remark 2.5.

The quantity $\sup_{A\in\mathcal{A}}|\mu(A)-\nu(A)|$ in Theorem 2.4 is precisely equal to $\operatorname{TV}(\mu,\nu)$ if $\mathcal{A}$ is the Borel $\sigma$ -algebra on $\mathcal{X}$ .

Corollary 2.6.

Let $\mathcal{F},\mathcal{A}_{\Theta},f_{n,\theta},f,\mu,\mu_{n}$ be as in Theorem 2.3. Then, there is a universal constant $c>0$ for which

[TABLE]

3 Non-increasing densities

This section is devoted to presenting a proof of the upper bound of Theorem 2.1. The lower bound is proved in Appendix A.1 using a careful but standard application of Assouad’s Lemma [2]. Part of our analysis in proving Theorem 2.1 will involve the development of an explicit efficient estimator for a density in $\mathcal{F}_{k}$ .

3.1 A greedy tree-based estimator

Suppose that $k$ is a power of two. This assumption can only, at worst, smudge some constant factors in the final minimax rate. Using the samples $X_{1},\dots,X_{n}\mathbin{\mathrel{\overset{i.i.d.}{\scalebox{2.0}[1.0]{$ \sim $}}}}f\in\mathcal{F}_{k}$ , we recursively construct a rooted ordered binary tree $\widehat{T}$ which determines a partition of the interval $\{1,\dots,k\}$ , from which we can build a histogram estimate $\hat{f}_{n}$ for $f$ . Specifically, let $\rho$ be the root of $\widehat{T}$ , where $I_{\rho}=\{1,\dots,k\}$ . We say that $\rho$ covers the interval $I_{\rho}$ . Then, for every node $u$ in $\widehat{T}$ covering the interval

[TABLE]

we first check if $|I_{u}|=1$ , and if so we make $u$ a leaf in $\widehat{T}$ . Otherwise, if

[TABLE]

are the first and second halves of $I_{u}$ , we verify the condition

[TABLE]

where $N_{v},N_{w}$ are the number of samples which fall into the intervals $I_{v},I_{w}$ , i.e.,

[TABLE]

The inequality (3.1) is referred to as the greedy splitting rule. If (3.1) is satisfied, then create nodes $v,w$ covering $I_{v}$ and $I_{w}$ respectively, and add them to $\widehat{T}$ as left and right children of $u$ . If not, make $u$ a leaf in $\widehat{T}$ .

After applying this procedure, one obtains a (random) tree $\widehat{T}$ with leaves $\widehat{L}$ , and the set $\{I_{u}\colon u\in\widehat{L}\}$ forms a partition of the support $\{1,\dots,k\}$ . Let $\hat{f}_{n}$ be the histogram estimate based on this partition, i.e.,

[TABLE]

The density estimate $\hat{f}_{n}$ is called the greedy tree-based estimator. See Figure 1 for a typical plot of $\hat{f}_{n}$ , and a visualization of the tree $\widehat{T}$ .

Remark 3.1.

Intuitively, we justify the rule (3.1) as follows: We expect that $N_{v}$ is at least as large as $N_{w}$ by monotonicity of the density $f$ , and the larger the difference $|N_{v}-N_{w}|$ , the finer a partition around $I_{v}$ and $I_{w}$ should be to minimize the error of a piecewise constant estimate of $f$ . However, even if $N_{v}$ and $N_{w}$ were equal in expectation, we expect with positive probability that $N_{v}$ may deviate from $N_{w}$ on the order of a standard deviation, i.e., on the order of $\sqrt{N_{v}+N_{w}}$ , and this determines the threshold for splitting.

Remark 3.2.

One could argue that any good estimate of a non-increasing density should itself be non-increasing, and the estimate $\hat{f}_{n}$ does not have this property. This can be rectified using a method of Birgé [8], who described a transformation of piecewise-constant density estimates which does not increase risk with respect to non-increasing densities. Specifically, suppose that the estimate $\hat{f}_{n}$ is not non-increasing. Then, there are consecutive intervals $I_{v}$ , $I_{w}$ such that $\hat{f}_{n}$ has constant value $y_{v}$ on $I_{v}$ and $y_{w}$ on $I_{w}$ , and $y_{v}<y_{w}$ . Let the transformed estimate be constant on $I_{v}\cup I_{w}$ , with value

[TABLE]

i.e., the average value of $\hat{f}_{n}$ on $I_{v}\cup I_{w}$ . Iterate the above transformation until a non-increasing estimate is obtained. It can be proven that this results in a unique estimate $\hat{f}_{n}^{\prime}$ , regardless of the order of merged intervals, and that

[TABLE]

3.2 An idealized tree-based estimator

Instead of analyzing the greedy tree-based estimator $\hat{f}_{n}$ of the preceding section, we fully analyze an idealized version. Indeed, in (3.1), the quantities $N_{z}$ are distributed as $\operatorname{Binomial}(n,f_{z})$ for $z\in\{v,w\}$ , where we define

[TABLE]

If we replace the quantities in (3.1) with their expectations, we obtain the idealized splitting rule

[TABLE]

where we note that $f_{v}\geq f_{w}$ , since $f$ is non-increasing. Using the same procedure as in the preceding section, replacing the splitting rule with (3.2), we obtain a deterministic tree $T^{*}=T^{*}(f)$ with leaves $L^{*}$ , and we set

[TABLE]

i.e., $f^{*}_{n}$ is constant and equal to the average value of $f$ on each interval $I_{u}$ for $u\in L^{*}$ . We call $f^{*}_{n}$ the idealized tree-based estimate. See Figure 2 for a visualization of $f^{*}_{n}$ and $T^{*}$ . It may be instructive to compare Figure 2 to Figure 1.

Of course, $T^{*}$ and $f^{*}_{n}$ both depend intimately upon knowledge of the density $f$ ; in practice, we only have access to the samples $X_{1},\dots,X_{n}\mathbin{\mathrel{\overset{i.i.d.}{\scalebox{2.0}[1.0]{$ \sim $}}}}f$ , and the density $f$ itself is unknown. In particular, we cannot practically use $f^{*}_{n}$ as an estimate for unknown $f$ . Importantly, as we will soon show, we can still use $f^{*}_{n}$ along with Corollary 2.6 to get a minimax rate upper bound for $\mathcal{F}_{k}$ .

Proposition 3.3.

[TABLE]

Proof.

Writing out the TV-distance explicitly, we have

[TABLE]

Let $u\in L^{*}$ , and define $A_{u}=\sum_{x\in I_{u}}|\bar{f}_{u}-f(x)|$ . If $|I_{u}|=1$ , then $A_{u}=0$ , so assume that $|I_{u}|>1$ . In this case, let $I_{v}$ and $I_{w}$ be the left and right halves of the interval $I_{u}$ , and let $\bar{f}_{v}$ and $\bar{f}_{w}$ be the average value of $f$ on $I_{v}$ and $I_{w}$ respectively. Write also

[TABLE]

Refer to Figure 3. We view $A_{u}$ as the positive area between the curve $f$ and the line $\bar{f}_{u}$ ; in the figure, this is the patterned area. Then, $B_{v}$ is the positive area between $f$ and $\bar{f}_{v}$ on $I_{v}$ , which is represented as the gray area on $I_{v}$ in Figure 3, and $B_{w}$ is the positive area between $f$ and $\bar{f}_{w}$ on $I_{w}$ , the gray area on $I_{w}$ in Figure 3. For $z\in\{v,u,w\}$ , let $x_{z}$ be the largest point in $I_{z}$ for which $f(x_{z})\geq\bar{f}_{z}$ . By the triangle inequality,

[TABLE]

Furthermore,

[TABLE]

where the second equality follows by the choice of $x_{v}$ . A similar relation holds for $B_{w}$ , whence

[TABLE]

where this last inequality follows from the splitting rule (3.2), since $u\in L^{*}$ and $|I_{u}|>1$ . So,

[TABLE]

by the Cauchy-Schwarz inequality. ∎

Proposition 3.4.

If $n\geq 64$ and $2n^{1/3}\leq k<n^{1/3}2^{n}$ , then

[TABLE]

Proof.

Note that $T^{*}$ has height at most $\log_{2}k$ . Let $U_{j}$ be the set of nodes at depth $j-1$ in $T^{*}$ which have at least one leaf as a child, for $1\leq j\leq\log_{2}k$ , and label the children of the nodes in $U_{j}$ in order of appeareance from right to left in $T^{*}$ as $u_{1},u_{2},\dots,u_{2|U_{j}|}$ . Since none of the nodes in $U_{j}$ are themselves leaves, then by (3.2),

[TABLE]

and in particular since $f_{u_{1}}\geq 0$ , then $f_{u_{2}}>\sqrt{f_{u_{2}}/n},$ so that $f_{u_{2}}>1/n$ . In general,

[TABLE]

and this recurrence relation can be solved to obtain that

[TABLE]

Let $L_{j}$ be the set of leaves at level $j$ in $T^{*}$ . The leaves at level $j$ in order from right to left form a subsequence $v_{1},v_{2},\dots,v_{|L_{j}|}$ of $u_{1},u_{2},\dots,u_{2|U_{j}|}$ . Write $q_{j}$ for the total probability mass of $f$ held in the leaves $L_{j}$ , i.e.,

[TABLE]

By (3.3) and since $f_{v_{i}}\geq f_{u_{i}}$ for each $i$ ,

[TABLE]

so that

[TABLE]

Summing over all leaves and using the facts that $n\geq 64$ and $2n^{1/3}\leq k<n^{1/3}2^{n}$ ,

[TABLE]

By Hölder’s inequality,

[TABLE]

so finally

[TABLE]

Proof of the upper bound in Theorem 2.1.

The case $k\geq n^{1/3}2^{n}$ is trivial, and follows simply because the TV-distance is always upper bounded by $1$ .

Suppose next that $2n^{1/3}>k$ . In this regime, we can use a histogram estimator for $f$ with bins of size $1$ for each element of $\{1,\dots,k\}$ . It is well known that risk of this estimator is on the order of $\sqrt{k/n}$ [16].

Finally, suppose that $2n^{1/3}\leq k<n^{1/3}2^{n}$ . Let $\mathcal{F}_{\Theta}$ be the class of all piecewise-constant probability densities on $\{1,\dots,k\}$ which have $\ell=|L^{*}|$ parts; in particular, $f^{*}_{n}\in\mathcal{F}_{\Theta}$ . Let $\mathcal{A}_{\Theta}$ be the Yatracos class of $\mathcal{F}_{\Theta}$ ,

[TABLE]

Then, $\mathcal{A}_{\Theta}\subseteq\mathcal{A}$ , where $\mathcal{A}$ is the class of all unions of at most $\ell$ intervals in $\mathbb{N}$ . It is well known that $\operatorname{VC}(\mathcal{A})=2\ell$ , so $\operatorname{VC}(\mathcal{A}_{\Theta})\leq 2\ell$ . By Corollary 2.6 and Proposition 3.3, there are universal constants $c_{1},c_{2}>0$ for which

[TABLE]

By Proposition 3.4, we see that for sufficiently large $n$ , there is a universal constant $c_{3}>0$ such that

[TABLE]

Remark 3.5.

Fix $B>0$ and let $\mathcal{F}_{B}^{\prime}$ be the class of all non-increasing densities supported on $[0,1]$ and bounded from above by $B$ . Our method can be applied to prove a minimax rate upper bound $\mathcal{F}_{B}^{\prime}$ . Now, the tree $T^{*}$ underlying the idealized tree-based estimator is truncated at some given level, say $m$ to be specified, and the idealized estimator should take on the average value of the true density $f$ on the truncated leaves. Write $d(u)$ for the depth of the node $u$ in $T^{*}$ . As in Proposition 3.3,

[TABLE]

The argument of Proposition 3.4 allows us to control the first sum, so that for some universal constant $c_{1}>0$ ,

[TABLE]

On the other hand, since $A_{u}\leq 5(f_{v}-f_{w})$ for $v$ the left child and $w$ the right child of $u$ , then

[TABLE]

An optimal choice of $m$ has that for a universal constant $c_{2}>0$ ,

[TABLE]

From here, using the same method as in the proof of Theorem 2.1, it follows that for some universal $c_{3}>0$ ,

[TABLE]

4 Non-increasing convex densities

Recall that $\mathcal{G}_{k}$ is the class of non-increasing convex densities supported on $\{1,\dots,k\}$ . Then, $\mathcal{G}_{k}$ forms a subclass of $\mathcal{F}_{k}$ , which we considered in Section 3. This section is devoted to extending the techniques of Section 3 in order to obtain a minimax rate upper bound on $\mathcal{G}_{k}$ . Again, the lower bound is proved using standard techniques in Appendix A.2.

In this section, we assume that $k$ is a power of three. In order to prove the upper bound of Theorem 2.2, we construct a ternary tree just as in Section 3, now with a ternary splitting rule, where if a node $u$ has children $v,w,r$ in order from left to right, we split and recurse if

[TABLE]

Here we obtain a tree $T^{\dagger}=T^{\dagger}(f)$ with leaves $L^{\dagger}$ . If $u\in L^{\dagger}$ has children $v,w,r$ from left to right, let $m_{z}$ be the midpoint of $I_{z}$ for $z\in\{v,w,r\}$ . Let the estimate $f^{\dagger}_{n}$ on $I_{u}$ be the line passing through the points $(m_{v},\bar{f}_{v})$ and $(m_{r},\bar{f}_{r})$ . Again, if $|I_{u}|=1$ , then $f^{\dagger}_{n}(x)=f(x)$ . We refer to $f_{n}^{\dagger}$ as the idealized tree-based estimate for $f$ .

Remark 4.1.

Since $f$ is non-increasing, the operation of Remark 3.2 can again by applied to $f_{n}^{\dagger}$ to obtain a non-increasing estimate ${f_{n}^{\dagger}}^{\prime}$ for which

[TABLE]

Proposition 4.2.

[TABLE]

Before proving this, we first note that by convexity of $f$ , the slope of the line passing through $(m_{w},\bar{f}_{w})$ and $(m_{r},\bar{f}_{r})$ is at least the slope of the line passing through $(m_{v},\bar{f}_{v})$ and $(m_{w},\bar{f}_{w})$ . Equivalently,

[TABLE]

Proof of Proposition 4.2.

As in the proof of Proposition 3.3, we have

[TABLE]

for $A_{u}=\sum_{x\in I_{u}}|f_{n}^{\dagger}(x)-f(x)|$ . We refer to Figure 4 for a visualization of the quantity $A_{u}$ , which is depicted as the patterned area.

For $z\in\{v,w,r\}$ , write

[TABLE]

By convexity of $f$ ,

[TABLE]

Observe also by convexity that $f(m_{w})\geq\bar{f}_{w}$ and $f(m_{r})\geq\bar{f}_{r}$ . So, the line segment between $(m_{w},\bar{f}_{w})$ and $(m_{r},\bar{f}_{r})$ lies above $f$ . Let $g_{wr}\colon\mathbb{R}\to\mathbb{R}$ be the line passing through the points $(m_{w},\bar{f}_{w})$ and $(m_{r},\bar{f}_{r})$ . Then,

[TABLE]

Let $m_{wr}$ be the midpoint of $m_{w}$ and $m_{r}$ , and $m_{vw}$ be the midpoint of $m_{v}$ and $m_{w}$ . Let $x_{w}\in[-\infty,m_{wr}]$ be the leftmost point where $g_{wr}$ intersects $f$ , if at all. Then,

[TABLE]

Since the right-hand side is non-negative, then indeed $x_{w}\in I_{w}$ and it must be that $f$ lies above $g_{wr}$ to the left of $x_{w}$ . Similarly, if $x_{w}^{\prime}\in[m_{vw},\infty]$ denotes the rightmost point where the line $g_{vw}\colon\mathbb{R}\to\mathbb{R}$ passing through $(m_{v},\bar{f}_{v})$ and $(m_{w},\bar{f}_{w})$ intersects $f$ , then $x_{w}^{\prime}\in I_{w}$ and $f$ lies above $g_{vw}$ to the right of $x_{w}^{\prime}$ . Therefore,

[TABLE]

It remains to bound $B_{v}$ and $B_{r}$ . Let $x_{v}\in I_{v}$ be the point where the line passing through $(m_{v},\bar{f}_{v})$ and $(m_{r},\bar{f}_{r})$ intersects $f$ . As before, this points exists, and since $f$ is non-increasing, $x_{v}\leq m_{v}$ . Futhermore,

[TABLE]

where the inequality follows from convexity and earlier remarks. A similar argument follows for $B_{r}$ .

In total,

[TABLE]

The result then follows from the splitting rule (4.1) and the Cauchy-Schwarz inequality. ∎

Proposition 4.3.

If $n\geq 3^{10}$ and $3n^{1/5}\leq k<n^{1/5}3^{n}$ , then

[TABLE]

Proof.

The tree $T^{\dagger}$ has height at most $\log_{3}k$ . Let $U_{j}$ be the set of nodes at depth $j-1$ in $T^{\dagger}$ with at least one leaf as a child, for $1\leq j\leq\log_{3}k$ , labelled in order of appearance from right to left in $T^{\dagger}$ as $u_{1},u_{2},\dots,u_{3|U_{j}|}$ . By the convex splitting rule (4.1), and since $f$ is non-increasing,

[TABLE]

so in particular, $f_{u_{3}}-f_{u_{2}}>1/n$ , and $f_{u_{3}}>1/n$ . In general,

[TABLE]

We claim now that $f_{u_{3i}}-f_{u_{3i-1}}>\frac{i^{3}}{27n}$ , which we prove by induction; the base case is shown above, and by the induction hypothesis,

[TABLE]

for all $i\geq 4$ , while the cases $i=2,3$ can be manually verified. Then, by monotonicity of $f$ ,

[TABLE]

Let now $L_{j}$ be the set of leaves at level $j$ in $T^{\dagger}$ . The leaves at level $j$ in order from right to left form a subsequence $v_{1},\dots,v_{|L_{j}|}$ of $u_{1},\dots,u_{3|U_{j}|}$ . Let $q_{j}$ be the total probability mass of $f$ held in the leaves $L_{j}$ . By (4.3) and since $f_{v_{i}}\geq f_{u_{i}}$ for each $i$ ,

[TABLE]

so that

[TABLE]

Summing over all leaves,

[TABLE]

By Hölder’s inequality,

[TABLE]

so finally

[TABLE]

Proof of the upper bound in Theorem 2.2.

The proof is similar to that of Theorem 2.1. ∎

Remark 4.4.

As in Remark 3.5, the argument can replicated in the continuous case, for bounded non-increasing convex densities supported on $[0,1]$ .

5 Discussion

It seems likely, given our results on the idealized tree-based estimators from Section 3.2 and Section 4, that the greedy tree-based estimators also behave well. In particular, we suspect that our greedy tree-based estimators are minimax-optimal within logarithmic factors. We leave this open to future work.

It is also often desirable for nonparametric estimators to be adaptive, in the sense that they attain the optimal minimax rate without depending on some of the important features of the nonparametric class in question. In some cases, an adaptive density estimate can be constructed by first estimating these features, and then building a density estimate assuming the estimated features. For example, in [8], an adapative estimate for non-increasing densities is developed by first estimating the size of the support, and plugging this estimated support size into a non-adaptive estimate. We expect that in this manner, our method can be made adaptive.

The techniques of this paper seem to naturally extend to higher dimensions. Take, for instance, the class of block-decreasing densities, whose minimax rate was identified by Biau and Devroye [6]. This is the class of densities supported on $[0,1]^{d}$ bounded by some constant $B\geq 0$ , such that each density is non-increasing in each coordinate if all other coordinates are held fixed. The discrete version of this class has each density supported on $\{1,\dots,k\}^{d}$ , with the monotonicity constraint. In order to estimate such a density, one could devise an oriented binary splitting rule analogous to (3.2) and carry out a similar analysis as performed in Section 3.2.

Furthermore, we expect that there are many other classes of one-dimensional densities whose optimal minimax rate could be identified using our approach, like the class of $\ell$ -monotone densities on $\{1,\dots,k\}$ , where a function $f$ is called $\ell$ -monotone if it is non-negative and if $(-1)^{j}f^{(j)}$ is non-increasing and convex for all $j\in\{0,\dots,\ell-2\}$ if $\ell\geq 2$ , and where $f$ is non-negative and non-increasing if $\ell=1$ . This paper tackles the cases of $\ell=1$ and $\ell=2$ . Write $\mathcal{F}_{k,\ell}$ for the class of $\ell$ -monotone densities on $\{1,\dots,k\}$ . See Balabdaoui and Wellner [3, 4] for texts concerning the density estimation of $\ell$ -monotone densities. It seems likely that our method could be applied to prove the following conjecture.

Conjecture 5.1.

Let $f\colon\mathbb{N}\times\mathbb{N}\times\mathbb{N}\times\mathbb{R}\to\mathbb{R}$ be

[TABLE]

Let $\ell\geq 1$ be fixed. There are constants $\alpha,C,n_{0}\geq 1$ depending only on $\ell$ such that, for $n\geq n_{0}$ ,

[TABLE]

The main obstacle in proving the above would be the development of good local estimates for $\ell$ -monotone densities, in the same flavor as Proposition 3.3 and Proposition 4.2.

Our approach also likely can be applied to the class of all log-concave discrete distributions, where we recall that $f:\mathbb{N}\to[0,1]$ is called log-concave if

[TABLE]

See [17, 27, 30] for a small selection of works on the density estimation of $d$ -dimensional log-concave continuous densities. The optimal Hellinger distance minimax rate (within logarithmic factors) for this class was recently obtained by Dagan and Kur [12], who showed that it is attained by the maximum-likelihood estimate. There remains a small gap between the best known upper and lower bounds in the TV-distance minimax rate as of the time of writing.

Acknowledgments

We would like to thank the three reviewers and an associate editor for their helpful comments and suggestions.

Appendix A Lower bounds

Lemma A.1 (Assouad’s Lemma [2, 16]).

Let $\mathcal{F}$ be a class of densities supported on the set $\mathcal{X}$ . Let $A_{0},A_{1},\dots,A_{r}$ be a partition of $\mathcal{X}$ , and $g_{ij}\colon A_{i}\to\mathbb{R}$ for $0\leq i\leq r$ and $j\in\{0,1\}$ be some collection of functions. For $\theta=(\theta_{1},\dots,\theta_{r})\in\{0,1\}^{r}$ , define the function $f_{\theta}\colon\mathcal{X}\to\mathbb{R}$ by

[TABLE]

such that each $f_{\theta}$ is a density on $\mathcal{X}$ . Let $\zeta_{i}\in\{0,1\}^{n}$ agree with $\theta$ on all bits except for the $i$ -th bit. Then, suppose that

[TABLE]

and

[TABLE]

Let $\mathcal{H}$ be the hypercube of densities

[TABLE]

If $\mathcal{H}\subseteq\mathcal{F}$ , then

[TABLE]

A.1 Proof of the lower bound in Theorem 2.1.

Suppose first that $e^{8}n^{1/3}\leq k\leq n^{1/3}e^{n}$ . Let $A_{1},\dots,A_{r}$ be consecutive intervals of even cardinality, starting from the leftmost atom $1$ . Split each $A_{i}$ in two equal parts, $A_{i}^{\prime}$ and $A_{i}^{\prime\prime}$ . Let $\varepsilon\in(0,1/\sqrt{2})$ , and set

[TABLE]

It is clear that each $f_{\theta}$ is a density. In order for each $f_{\theta}$ to be monotone, we require that

[TABLE]

and in particular

[TABLE]

Pick $|A_{1}|=2$ . Since $\log(1+\varepsilon)-\log(1-\varepsilon)\leq 4\varepsilon$ for $\varepsilon\in(0,1/\sqrt{2})$ , it suffices to take

[TABLE]

Let $|A_{i}|$ be the smallest even integer at least equal to $a_{i}$ , so that $a_{i}\leq|A_{i}|\leq a_{i}+2$ , and thus

[TABLE]

Since the support of our densities is $\{1,\dots,k\}$ , then we ask that this last upper bound not exceed $k$ . We can guarantee this in particular with a choice of $r$ and $\varepsilon$ for which

[TABLE]

Fix $1\leq i\leq r$ . Then,

[TABLE]

so

[TABLE]

On the other hand,

[TABLE]

Now pick

[TABLE]

and $r$ for which

[TABLE]

or equivalently,

[TABLE]

Note that $k\leq n^{1/3}e^{n}$ now implies that $\varepsilon\in(0,1/\sqrt{2})$ . With this choice, Lemma A.1 implies that

[TABLE]

So we need only verify that these choices of $\varepsilon$ and $r$ are compatible with (A.1). Since $k\geq e^{8}n^{1/3}$ , then there is an integer choice of $r$ in the range

[TABLE]

In particular, we can verify that

[TABLE]

since $2\log^{2/3}x\leq x$ for all $x\geq 0$ . Moreover, since $k\geq e^{8}n^{1/3}$ , then $\varepsilon\geq\frac{1}{2n^{1/3}}$ , so that

[TABLE]

where this last inequality holds since $\sqrt{x}\leq x/2$ for all $x\geq e^{8}$ , so that (A.2) is proved.

When $k\geq n^{1/3}e^{n}$ , we argue by inclusion that

[TABLE]

The only remaining case is $k\leq e^{8}n^{1/3}$ . In this case, we offer a different construction. Now, each $A_{i}$ will have size $2$ for $1\leq i\leq r$ , where $r={\lfloor k/2\rfloor}$ . Fix $a,b\in\mathbb{R}$ to be specified later, and set

[TABLE]

We insist that

[TABLE]

for some $0\leq\varepsilon\leq 1$ . Since each $f_{\theta}$ must be a density, we need that

[TABLE]

Both of these conditions will be satisfied if we pick

[TABLE]

Furthermore, the largest probability of an atom here is

[TABLE]

for $k\geq 2$ . Then, for $1\leq i\leq r$ , we can compute

[TABLE]

so

[TABLE]

and

[TABLE]

Pick $\varepsilon=e^{-12}r\sqrt{k/n}$ . Then, since $2\leq k\leq e^{8}n^{1/3}$ and $r={\lfloor k/2\rfloor}\geq k/3$ , then $\varepsilon\leq 1/2$ , and

[TABLE]

so that

[TABLE]

A.2 Proof of the lower bound in Theorem 2.2.

Let $A_{1},\dots,A_{r}$ be the partition in Lemma A.1, for an integer $r\geq 1$ to be specified. Let $j_{i}$ be the smallest element of $A_{i}$ , and suppose that each $|A_{i}|$ is chosen to be a positive multiple of $3$ . We will define the functions $f_{\theta}$ based on parameters $\beta_{i},\Delta_{i}\in\mathbb{R}$ , to be specified. Let $g_{i0}$ linearly interpolate between the points

[TABLE]

on $\{j_{i},j_{i}+1,\dots,j_{i}+|A_{i}|/3-1\}$ , and between the points

[TABLE]

on $\{j_{i}+|A_{i}|/3,\dots,j_{i}+|A_{i}|-1\}$ . Let $g_{i1}$ linearly interpolate between the points

[TABLE]

on $\{j_{i},j_{i}+1,\dots,j_{i}+2|A_{i}|/3-1\}$ , and between the points

[TABLE]

on $\{j_{i}+2|A_{i}|/3,\dots,j_{i}+|A_{i}|-1\}$ . Then, each $f_{\theta}$ will be nonincreasing as long as $\beta_{i}\geq\beta_{i+1}$ for each $1\leq i\leq r$ , and

[TABLE]

Each $f_{\theta}$ will be convex as long as the largest slope on $A_{i}$ is at most the smallest slope on $A_{i+1}$ for each $1\leq i\leq r$ . Equivalently,

[TABLE]

Now, pick $\beta_{i}=(1-\varepsilon)^{i-1}\beta$ for some $\beta\in\mathbb{R}$ , $\varepsilon\in(0,1)$ to be specified, and

[TABLE]

for which (A.3) is immediately satisfied. The condition (A.4) is then equivalent to

[TABLE]

Pick $|A_{1}|=3$ . It is sufficient to make the choice

[TABLE]

Let $|A_{i}|$ be the smallest integer multiple of $3$ at least as large as $a_{i}$ , so that $a_{i}\leq|A_{i}|\leq a_{i}+3$ . If $\varepsilon\leq 1/2$ , then

[TABLE]

Since the support of our densities is $\{1,\dots,k\}$ , this upper bound must not exceed $k$ , so we impose that

[TABLE]

We must tune $\beta$ in order for each $f_{\theta}$ to be a density. By monotonicity, we must have

[TABLE]

and

[TABLE]

so there is a choice of $\beta$ where

[TABLE]

as long as $\varepsilon\leq 1/2$ . Now, fix $1\leq i\leq r$ . Then,

[TABLE]

and

[TABLE]

as long as $\varepsilon\leq 1/2$ , whence

[TABLE]

Now, pick

[TABLE]

and $r$ for which

[TABLE]

or equivalently,

[TABLE]

Note that $k\leq n^{1/5}e^{n}$ now implies that $\varepsilon\leq 1/2$ . With this choice, Lemma A.1 has

[TABLE]

so it remains to verify that our choices of $\varepsilon$ and $r$ are compatible with (A.5). Since $k\geq e^{40}n^{1/5}$ , then there is an integer choice of $r$ in the range

[TABLE]

In particular, we can verify that

[TABLE]

since $(2/3)\log^{4/5}x\leq x$ for all $x\geq 0$ . Moreover, since $k\geq e^{40}n^{1/5}$ , then $\varepsilon\geq 1/n^{1/5}$ , so that

[TABLE]

since $x^{1/9}\leq x/3$ for all $x\geq e^{40}$ , so that (A.6) is proved.

When $k\geq n^{1/5}e^{n}$ , we argue by inclusion that

[TABLE]

It remains to prove the case $k\leq e^{40}n^{1/5}$ . Observe that $\mathcal{G}_{2}=\mathcal{F}_{2}$ , so the lower bound for $k=2$ follows from Appendix A.1, so we assume that $k\geq 3$ . Now, each $A_{i}$ will have size $3$ for $1\leq i\leq r$ , where $r={\lfloor k/3\rfloor}$ . Fix $a,b\in\mathbb{R}$ to be specified later, and set

[TABLE]

and

[TABLE]

Each $f_{\theta}$ will be non-increasing as long as $\beta_{i}\geq\beta_{i+1}$ , and

[TABLE]

for each $1\leq i\leq r$ . Convexity will follow if

[TABLE]

or equivalently,

[TABLE]

We need also that $\beta_{1}\leq 1$ , $\beta_{r+1}\geq 0$ , and

[TABLE]

Take $\beta_{i+1}=\beta_{i}-3\Delta_{i}-\alpha(r-i)$ for $\alpha\geq 0$ to be specified.. Monotonicity follows, and convexity will follow if

[TABLE]

So take each $\Delta_{i}=\alpha/6$ . Then,

[TABLE]

and in particular,

[TABLE]

Take $\alpha=\varepsilon/r^{3}$ for some $0\leq\varepsilon\leq 1$ , whence

[TABLE]

By monotonicity,

[TABLE]

and

[TABLE]

so that the right choice of $\beta_{1}$ satisfies

[TABLE]

Fix $1\leq i\leq r$ . Then,

[TABLE]

and if $\varepsilon\leq 1/2$ ,

[TABLE]

Finally, pick $\varepsilon=e^{-100}r^{2}\sqrt{k/n}$ . Since $k\leq e^{40}n^{1/5}$ and $r={\lfloor k/3\rfloor}\geq k/6$ , then $\varepsilon\leq 1/2$ , and

[TABLE]

so by Lemma A.1,

[TABLE]

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] {barticle} [author] \bauthor \bsnm Anevski, \bfnm Dragi \binits D. ( \byear 2003). \btitle Estimating the derivative of a convex density. \bjournal Statist. Neerlandica \bvolume 57 \bpages 245–257. \bdoi 10.1111/1467-9574.00229 \bmrnumber 2028914 \endbibitem
2[2] {barticle} [author] \bauthor \bsnm Assouad, \bfnm Patrice \binits P. ( \byear 1983). \btitle Deux remarques sur l’estimation. \bjournal C. R. Acad. Sci. Paris Sér. I Math. \bvolume 296 \bpages 1021–1024. \bmrnumber 777600 \endbibitem
3[3] {barticle} [author] \bauthor \bsnm Balabdaoui, \bfnm Fadoua \binits F. and \bauthor \bsnm Wellner, \bfnm Jon A. \binits J. A. ( \byear 2007). \btitle Estimation of a k 𝑘 k -monotone density: limit distribution theory and the spline connection. \bjournal Ann. Statist. \bvolume 35 \bpages 2536–2564. \bdoi 10.1214/009053607000000262 \bmrnumber 2382657 \endbibitem
4[4] {barticle} [author] \bauthor \bsnm Balabdaoui, \bfnm Fadoua \binits F. and \bauthor \bsnm Wellner, \bfnm Jon A. \binits J. A. ( \byear 2010). \btitle Estimation of a k 𝑘 k -monotone density: characterizations, consistency and minimax lower bounds. \bjournal Stat. Neerl. \bvolume 64 \bpages 45–70. \bdoi 10.1111/j.1467-9574.2009.00438.x \bmrnumber 2830965 \endbibitem
5[5] {barticle} [author] \bauthor \bsnm Bellec, \bfnm Pierre C. \binits P. C. ( \byear 2018). \btitle Sharp oracle inequalities for least squares estimators in shape restricted regression. \bjournal Ann. Statist. \bvolume 46 \bpages 745–780. \bdoi 10.1214/17-AOS 1566 \bmrnumber 3782383 \endbibitem
6[6] {barticle} [author] \bauthor \bsnm Biau, \bfnm Gérard \binits G. and \bauthor \bsnm Devroye, \bfnm Luc \binits L. ( \byear 2003). \btitle On the risk of estimates for block decreasing densities. \bjournal J. Multivariate Anal. \bvolume 86 \bpages 143–165. \bdoi 10.1016/S 0047-259X(02)00028-3 \bmrnumber 1994726 \endbibitem
7[7] {barticle} [author] \bauthor \bsnm Birgé, \bfnm Lucien \binits L. ( \byear 1987). \btitle Estimating a density under order restrictions: nonasymptotic minimax risk. \bjournal Ann. Statist. \bvolume 15 \bpages 995–1012. \bdoi 10.1214/aos/1176350488 \bmrnumber 902241 \endbibitem
8[8] {barticle} [author] \bauthor \bsnm Birgé, \bfnm Lucien \binits L. ( \byear 1987). \btitle On the risk of histograms for estimating decreasing densities. \bjournal Ann. Statist. \bvolume 15 \bpages 1013–1022. \bdoi 10.1214/aos/1176350489 \bmrnumber 902242 \endbibitem

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Discrete minimax estimation with trees

Abstract

keywords:

keywords:

Contents

1 Introduction

2 Preliminaries and summary

Theorem 2.1**.**

Theorem 2.2**.**

Theorem 2.3** (Devroye, Lugosi [16]).**

Theorem 2.4** (Devroye, Lugosi [16]).**

Remark 2.5**.**

Corollary 2.6**.**

3 Non-increasing densities

3.1 A greedy tree-based estimator

Remark 3.1**.**

Remark 3.2**.**

3.2 An idealized tree-based estimator

Proposition 3.3**.**

Proof.

Proposition 3.4**.**

Proof.

Proof of the upper bound in Theorem 2.1.

Remark 3.5**.**

4 Non-increasing convex densities

Remark 4.1**.**

Proposition 4.2**.**

Proof of Proposition 4.2.

Proposition 4.3**.**

Proof.

Proof of the upper bound in Theorem 2.2.

Remark 4.4**.**

5 Discussion

Conjecture 5.1**.**

Acknowledgments

Appendix A Lower bounds

Lemma A.1** (Assouad’s Lemma [2, 16]).**

A.1 Proof of the lower bound in Theorem 2.1.

A.2 Proof of the lower bound in Theorem 2.2.

Theorem 2.1.

Theorem 2.2.

Theorem 2.3 (Devroye, Lugosi [16]).

Theorem 2.4 (Devroye, Lugosi [16]).

Remark 2.5.

Corollary 2.6.

Remark 3.1.

Remark 3.2.

Proposition 3.3.

Proposition 3.4.

Remark 3.5.

Remark 4.1.

Proposition 4.2.

Proposition 4.3.

Remark 4.4.

Conjecture 5.1.

Lemma A.1 (Assouad’s Lemma [2, 16]).