Optimal Linear Discriminators For The Discrete Choice Model In Growing   Dimensions

Debarghya Mukherjee; Moulinath Banerjee; Ya'acov Ritov

arXiv:1903.10063·math.ST·August 11, 2020

Optimal Linear Discriminators For The Discrete Choice Model In Growing Dimensions

Debarghya Mukherjee, Moulinath Banerjee, Ya'acov Ritov

PDF

Open Access

TL;DR

This paper investigates the behavior of Manski's maximum score estimator for discrete choice models in high-dimensional settings, deriving convergence rates, bounds, and optimal estimators for different growth regimes of the dimension.

Contribution

It extends the analysis of the maximum score estimator to scenarios where the number of predictors grows with the sample size, providing convergence rates, bounds, and computational methods.

Findings

01

Derived $ ext{ell}_2$ convergence rates under different growth regimes.

02

Established minimax bounds for estimation error in high dimensions.

03

Proposed algorithms for computing the maximum score estimator in large dimensions.

Abstract

Manski's celebrated maximum score estimator for the discrete choice model, which is an optimal linear discriminator, has been the focus of much investigation in both the econometrics and statistics literatures, but its behavior under growing dimension scenarios largely remains unknown. This paper addresses that gap. Two different cases are considered: $p$ grows with $n$ but at a slow rate, i.e. $p / n \to 0$ ; and $p ≫ n$ (fast growth). In the binary response model, we recast Manski's score estimation as empirical risk minimization for a classification problem, and derive the $ℓ_{2}$ rate of convergence of the score estimator under a \emph{transition condition} in terms of our margin parameter that calibrates the level of difficulty of the estimation problem. We also establish upper and lower bounds for the minimax $ℓ_{2}$ error in the binary choice model that differ by a…

Equations967

Y_{i}^{*} = X_{i}^{'} β^{0} + ϵ_{i}

Y_{i}^{*} = X_{i}^{'} β^{0} + ϵ_{i}

Y_{i} = sgn (Y_{i}^{*}) = sgn (X_{i}^{⊤} β^{0} + ϵ_{i}) .

Y_{i} = sgn (Y_{i}^{*}) = sgn (X_{i}^{⊤} β^{0} + ϵ_{i}) .

S (β) = E (Y sgn (X^{'} β)) = E (sgn (Y^{*}) sgn (X^{'} β))

S (β) = E (Y sgn (X^{'} β)) = E (sgn (Y^{*}) sgn (X^{'} β))

S_{n} (β) = \frac{1}{n} i = 1 \sum n Y_{i} sgn (X_{i}^{T} β) = \frac{1}{n} i = 1 \sum n sgn (Y_{i}^{*}) sgn (X_{i}^{T} β) .

S_{n} (β) = \frac{1}{n} i = 1 \sum n Y_{i} sgn (X_{i}^{T} β) = \frac{1}{n} i = 1 \sum n sgn (Y_{i}^{*}) sgn (X_{i}^{T} β) .

\hat{β}_{n} = argmax_{β : ∥ β ∥ = 1} S_{n} (β) .

\hat{β}_{n} = argmax_{β : ∥ β ∥ = 1} S_{n} (β) .

η (x) = P (Y = 1∣ X = x) = 1 - F_{ϵ ∣ X = x} (- x^{T} β^{0}) .

η (x) = P (Y = 1∣ X = x) = 1 - F_{ϵ ∣ X = x} (- x^{T} β^{0}) .

G = {g_{β} : g_{β} (x) = sgn (x^{T} β), ∥ β ∥ = 1} .

G = {g_{β} : g_{β} (x) = sgn (x^{T} β), ∥ β ∥ = 1} .

L (β) = L (g_{β}) = P (Y \neq = sgn (X^{T} β)),

L (β) = L (g_{β}) = P (Y \neq = sgn (X^{T} β)),

sgn (η (x) - 0.5) = sgn (1 - F_{ϵ ∣ X = x} (- x^{T} β^{0}) - 0.5) = sgn (0.5 - F_{ϵ ∣ X = x} (- x^{T} β^{0})) = sgn (x^{T} β^{0}) = g_{β^{0}} [∵ med (ϵ ∣ X) = 0] .

sgn (η (x) - 0.5) = sgn (1 - F_{ϵ ∣ X = x} (- x^{T} β^{0}) - 0.5) = sgn (0.5 - F_{ϵ ∣ X = x} (- x^{T} β^{0})) = sgn (x^{T} β^{0}) = g_{β^{0}} [∵ med (ϵ ∣ X) = 0] .

u_{i, k} \geq u_{i, j} for all j \neq = k, j \in {1, 2, \dots, m} .

u_{i, k} \geq u_{i, j} for all j \neq = k, j \in {1, 2, \dots, m} .

y_{i, k} = {1, if u_{i, k} \geq u_{i, j} for all j \neq = k, j \in {1, 2, \dots, m} 0, otherwise .

y_{i, k} = {1, if u_{i, k} \geq u_{i, j} for all j \neq = k, j \in {1, 2, \dots, m} 0, otherwise .

p (j ∣ X, β) = P (x_{j}^{⊤} β + ϵ_{j} \geq x_{k}^{⊤} β + ϵ_{k} \forall k \in {1, \dots, m}, j \neq = k ∣ X) .

p (j ∣ X, β) = P (x_{j}^{⊤} β + ϵ_{j} \geq x_{k}^{⊤} β + ϵ_{k} \forall k \in {1, \dots, m}, j \neq = k ∣ X) .

S_{n}^{(m u l t)} (β) = \frac{1}{nm ( m - 1 )} i = 1 \sum n j = 1 \sum m y_{i, j} k \neq = j \sum \mathds 1 (x_{i, j}^{⊤} β > x_{i, k}^{⊤} β) .

S_{n}^{(m u l t)} (β) = \frac{1}{nm ( m - 1 )} i = 1 \sum n j = 1 \sum m y_{i, j} k \neq = j \sum \mathds 1 (x_{i, j}^{⊤} β > x_{i, k}^{⊤} β) .

u_{i, 1} - u_{i, 2} > 0 ⟺ (x_{i, 1} - x_{i, 2})^{⊤} β_{0} + (ϵ_{i, 1} - ϵ_{i, 2}) > 0,

u_{i, 1} - u_{i, 2} > 0 ⟺ (x_{i, 1} - x_{i, 2})^{⊤} β_{0} + (ϵ_{i, 1} - ϵ_{i, 2}) > 0,

P (η (X) - \frac{1}{2} \leq t) \leq C t^{α} \forall 0 \leq t \leq t^{*},

P (η (X) - \frac{1}{2} \leq t) \leq C t^{α} \forall 0 \leq t \leq t^{*},

P (η (X) - \frac{1}{2} \leq t) \leq C_{n} t \forall 0 \leq t \leq t^{*},

P (η (X) - \frac{1}{2} \leq t) \leq C_{n} t \forall 0 \leq t \leq t^{*},

P_{X} (∣ η (X) - 0.5∣ \leq t)

P_{X} (∣ η (X) - 0.5∣ \leq t)

= P_{X} (F_{ϵ}^{- 1} (0.5 - t) \leq - X^{⊤} β^{0} \leq F_{ϵ}^{- 1} (0.5 + t))

\leq P_{X} (∣ X^{⊤} β^{0} ∣ \leq (F_{ϵ}^{- 1} (0.5 + t) \lor - F_{ϵ}^{- 1} (0.5 - t)))

\leq 2 k (F_{ϵ}^{- 1} (0.5 + t) \lor - F_{ϵ}^{- 1} (0.5 - t))

\leq \frac{2 k t}{c _{δ}}

∣ F_{ϵ ∣ X = x} (x^{T} β^{0}) - 0.5∣ = ∣ F_{ϵ ∣ X = x} (x^{T} β^{0}) - F_{ϵ ∣ X = x} (0) ∣ \geq C (∣ x^{T} β^{0} ∣ \land ξ) a . e . X

∣ F_{ϵ ∣ X = x} (x^{T} β^{0}) - 0.5∣ = ∣ F_{ϵ ∣ X = x} (x^{T} β^{0}) - F_{ϵ ∣ X = x} (0) ∣ \geq C (∣ x^{T} β^{0} ∣ \land ξ) a . e . X

P_{X} (sgn (X^{T} β) \neq = sgn (X^{T} β^{0})) \geq c_{1} ∥ β - β^{0} ∥_{2}

P_{X} (sgn (X^{T} β) \neq = sgn (X^{T} β^{0})) \geq c_{1} ∥ β - β^{0} ∥_{2}

S (β_{0}) - S (β) \geq [\frac{∥ β - β ^{0} ∥ _{2}^{2}}{C _{n}} \mathds 1_{(d_{Δ} (β, β^{0}) \leq 2 t^{*} C_{n})} + 2 t^{*} ∥ β - β^{0} ∥_{2} \mathds 1_{(d_{Δ} (β, β^{0}) > 2 t^{*} C_{n})}]

S (β_{0}) - S (β) \geq [\frac{∥ β - β ^{0} ∥ _{2}^{2}}{C _{n}} \mathds 1_{(d_{Δ} (β, β^{0}) \leq 2 t^{*} C_{n})} + 2 t^{*} ∥ β - β^{0} ∥_{2} \mathds 1_{(d_{Δ} (β, β^{0}) > 2 t^{*} C_{n})}]

S (β^{0}) - S (β) \geq \frac{1}{C} ∥ β - β^{0} ∥_{2}^{2}

S (β^{0}) - S (β) \geq \frac{1}{C} ∥ β - β^{0} ∥_{2}^{2}

P ((\frac{r _{n}}{C _{n}} \land r_{n}^{2}) ∥ \hat{β}_{n} - β^{0} ∥_{2} \geq K y) \leq 2 e^{- y}

P ((\frac{r _{n}}{C _{n}} \land r_{n}^{2}) ∥ \hat{β}_{n} - β^{0} ∥_{2} \geq K y) \leq 2 e^{- y}

r_{n} = (\frac{n}{p C _{n} lo g ( n / p C _{n}^{2} )})^{1/3} \land (\frac{n}{p lo g ( n / p )})^{1/2} .

r_{n} = (\frac{n}{p C _{n} lo g ( n / p C _{n}^{2} )})^{1/3} \land (\frac{n}{p lo g ( n / p )})^{1/2} .

β \equiv β (P) sup E_{β} ((\frac{r _{n}}{C _{n}} \land r_{n}^{2}) ∥ \hat{β}_{n} - β ∥_{2}) \leq K_{1},

β \equiv β (P) sup E_{β} ((\frac{r _{n}}{C _{n}} \land r_{n}^{2}) ∥ \hat{β}_{n} - β ∥_{2}) \leq K_{1},

r_{n} = (\frac{n}{p lo g ( n / p )})^{1/3} \land (\frac{n}{p lo g ( n / p )})^{1/2} = (\frac{n}{p lo g ( n / p )})^{1/3}

r_{n} = (\frac{n}{p lo g ( n / p )})^{1/3} \land (\frac{n}{p lo g ( n / p )})^{1/2} = (\frac{n}{p lo g ( n / p )})^{1/3}

\frac{r _{n}}{C _{n}} \land r_{n}^{2} = r_{n} \land r_{n}^{2} = r_{n} = (\frac{n}{p lo g ( n / p )})^{1/3} .

\frac{r _{n}}{C _{n}} \land r_{n}^{2} = r_{n} \land r_{n}^{2} = r_{n} = (\frac{n}{p lo g ( n / p )})^{1/3} .

r_{n} \approx (\frac{n}{p C _{n}})^{1/3} \land (\frac{n}{p})^{1/2}

r_{n} \approx (\frac{n}{p C _{n}})^{1/3} \land (\frac{n}{p})^{1/2}

\frac{r _{n}}{C _{n}} \land r_{n}^{2}

\frac{r _{n}}{C _{n}} \land r_{n}^{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEconomic theories and models · Economic Growth and Productivity · Economic Policies and Impacts

Full text

Optimal Linear Discriminators For The Discrete Choice Model In Growing Dimensions

Debarghya Mukherjeelabel=e1][email protected] [

Moulinath Banerjeelabel=e2][email protected] [

Ya’acov Ritovlabel=e3][email protected] [ University of Michigan

University of Michigan

437, West Hall,

1085 South University

Ann Arbor, MI 48109

University of Michigan

275, West Hall,

1085 South University

Ann Arbor, MI 48109

University of Michigan

462, West Hall,

1085 South University

Ann Arbor, MI 48109

Abstract

Manski’s celebrated maximum score estimator for the discrete choice model, which is an optimal linear discriminator, has been the focus of much investigation in both the econometrics and statistics literatures, but its behavior under growing dimension scenarios largely remains unknown. This paper addresses that gap. Two different cases are considered: $p$ grows with $n$ but at a slow rate, i.e. $p/n\rightarrow 0$ ; and $p\gg n$ (fast growth). In the binary response model, we recast Manski’s score estimation as empirical risk minimization for a classification problem, and derive the $\ell_{2}$ rate of convergence of the score estimator under a transition condition in terms of our margin parameter that calibrates the level of difficulty of the estimation problem. We also establish upper and lower bounds for the minimax $\ell_{2}$ error in the binary choice model that differ by a logarithmic factor, and construct a minimax-optimal estimator in the slow growth regime. Some extensions to the general case – the multinomial response model – are also considered. Last but not least, we use a variety of learning algorithms to compute the maximum score estimator in growing dimensions.

\arxiv

arXiv:1903.10063 \startlocaldefs

\endlocaldefs

,

and

t1Supported by NSF Grant DMS-1712962

1 Introduction

The maximum score estimator for the discrete choice model was introduced by Charles Manski in his seminal paper Manski [1975] in connection with the stochastic utility model of choice, and has been extensively studied in both the econometrics and the statistics literatures. The binary choice model can be considered as a linear regression model with missing data. More specifically, let

[TABLE]

where $\{X_{i},\epsilon_{i}\}$ are we $n$ i.i.d pairs, the distribution of $\epsilon_{i}$ is allowed to depend on $X_{i}$ and ${\sf med}(Y_{i}^{*}|X_{i})=X_{i}^{\prime}\beta^{0}$ (i.e. ${\sf med}(\epsilon_{i}|X_{i})=0$ ), but instead of observing the full data, we only get to see $\{Y_{i},X_{i}\}$ where

[TABLE]

The regression parameter $\beta^{0}$ is of interest. The population score function is defined as:

[TABLE]

and the corresponding sample score function is:

[TABLE]

The maximum score estimator is defined as any value of $\beta$ that maximizes the sample/empirical score function:

[TABLE]

Note that some norm restriction on $\beta$ is important both for identifiability of $\beta$ in this model, as well as for meaningful optimization. As $\beta^{0}$ is only identifiable and estimable up to direction, in what follows, we take $\|\beta_{0}\|=1$ . We also note that the choice of the maximizer is not important; in fact there is no unique maximizer. In follow-up work Manski [1985], Manski proved the consistency of $\hat{\beta}_{n}$ to the true $\beta^{0}$ and some large deviation results under mild assumptions. The asymptotic distribution properties of the maximum score estimator were established by Kim and Pollard Kim et al. [1990] who proved that under additional assumptions $(\beta^{0}-\hat{\beta}_{n})=O_{p}(n^{-\frac{1}{3}})$ and that the normalized difference converges in distribution to a non-Gaussian random variable that is characterized as the maximizer of a quadratically drifted Gaussian process. Shortly thereafter, Horowitz Horowitz [1992] established, under smoothness conditions beyond those in Kim et al. [1990], the estimator obtained by maximizing a kernel smoothed version of the score function can improve the rate of the smoothed estimator. One advantage of Horowitz’s estimator over the original maximum score estimator, from a practical viewpoint, is that the limit distribution in his setting is Gaussian and therefore more amenable to inference, while the quantiles of the non-Gaussian limit are hard to determine. Also, around the same time Klein and Spady (see Klein and Spady [1993]) proved that under the additional assumption ${\mathbb{E}}(Y|X)={\mathbb{E}}\left(Y|X^{\top}\beta^{0}\right)$ , one can obtain a consistent and asymptotically normal estimator of $\beta^{0}$ , which is also semi-parametrically efficient. More recently, Seo and Otsu (Seo et al. [2018], Seo and Otsu [2015]) have extended the asymptotic results on the score estimator to dependent data scenarios. Alternatively, resampling techniques can also be used for inference. Manski and Thompson Manski and Thompson [1986] suggested that the usual bootstrap yields a good approximation of the distribution of the maximum score estimator, but it turns out that the bootstrap is actually inconsistent, as shown in Abrevaya and Huang Abrevaya and Huang [2005] (but see also Sen et al. [2010]). More recently, a model–based smoothed bootstrap approach was proposed by Patra et.al. Patra et al. [2018]. Generic ( $m$ out of $n$ ) subsampling techniques Politis et al. [1999] can, of course, be used in principle, but typically suffer from imprecise coverage unless the subsample size $m$ is well-chosen, which is typically a difficult problem. For applications of maximum score estimators and their variants, see Briesch et al. [2002], Fox and Bajari [2013], Bajari et al. [2008] and references therein.

Connections to empirical risk minimization: The maximum score estimator is naturally connected to a classification problem with two classes. In Manski’s problem, we have observations $\{X_{1},Y_{1}\},\cdots,\{X_{n},Y_{n}\}$ , where $X_{i}\in\mathbb{R}^{p}$ and $Y_{i}\in\{-1,1\}$ , these being the labels of the two classes. The conditional class probabilities are specified by

[TABLE]

For classifying the $Y_{i}$ ’s using an arbitrary classifier $h$ under 0-1 loss, the population risk is given by $L(h)=\mathbb{P}(Y\neq h(X))$ . Consider the set of classifiers corresponding to all possible hyperplanes, i.e.

[TABLE]

The population risk under 0-1 loss for this family is then given by:

[TABLE]

and is consistently estimated by the empirical risk $L_{n}(\beta)=\mathbb{P}_{n}(Y\neq\text{sgn}(X^{\mathsf{T}}\beta))$ . From the structure of the model, it is easy to see that the Bayes’ classifier, i.e. the classifier which minimizes the population risk in this model (over all possible classifiers) is precisely $g_{\beta_{0}}$ :

[TABLE]

Thus $\tilde{\beta}_{n}:=\arg\min_{\beta}\,L_{n}(\beta)$ empirically estimates the Bayes classifier. By simple algebra $S(\beta)=1-2L(\beta)$ and $S_{n}(\beta)=1-2L_{n}(\beta)$ . Since the former is maximized at $\beta_{0}$ and the latter at $\hat{\beta}_{n}$ , it follows that $\hat{\beta}_{n}$ is one particular choice for $\tilde{\beta}_{n}$ . Thus, the maximum score estimator is the minimizer of the empirical risk in this classification problem. The rate of estimation of $\beta^{0}$ depends on two crucial factors: (1) The manner in which $P(Y=1|X)$ changes across the hyperplane and (2) The distribution of $X_{i}$ ’s near the hyperplane. If the conditional probability shifts from $1/2$ rather slowly as we move away from the hyperplane, we have a ‘fuzzier’ classification problem and estimation becomes more challenging. On the other hand, the distribution of the $X_{i}$ ’s governs the density of observed points around the hyperplane, with higher concentration of points being conducive to improved inference. As far as our knowledge goes, there is no work on the high-dimensional aspect of this model, so this paper bridges a gap in the literature.

The multinomial response discrete choice model: This model, which is a natural extension of its binary counterpart, arises in practice when an individual has to choose among finitely many elements, e.g. picking out a movie among several choices proposed by Netflix. In Manski [1975], Manski also proposed an extension of the maximum score estimator for multinomial responses. We first describe the model. Assume that each individual has to choose from $m$ many alternatives, for each of which they have a utility value. Denote by $u_{i,j}$ , the utility value of the $j^{th}$ alternative for the $i^{th}$ individual. Hence, $i$ will choose the $k^{th}$ alternative only if it provides them maximum utility, i.e.

[TABLE]

The utility values are modeled as $u_{i,j}=\mathbf{x}_{i,j}^{\top}\beta^{0}+\epsilon_{i,j}$ where $\mathbf{x}_{i,j}\in\mathbb{R}^{p}$ is a vector of observable covariates and $\epsilon_{i,j}$ is an unobservable error. For notational simplicity, define an $m\times p$ matrix $\mathbf{X}_{i}$ for individual $i$ whose $j^{th}$ row is $\mathbf{x}_{i,j}^{\top}$ , the co-variate corresponding to their $j^{th}$ utility. The $u_{i,j}$ ’s are not observed, but we do observe a multinomial vector $\mathbf{y}_{i}\in\{0,1\}^{m}$ for each $i$ , where

[TABLE]

In words, this vector indicates which alternative has been chosen by individual $i$ . The available data on $n$ individuals are therefore the $n$ pairs $\{\mathbf{y}_{i},\mathbf{X}_{i}\}_{i=1}^{n}$ which can be viewed as i.i.d replicates of a random object $(\mathbf{y}_{1\times m},\mathbf{X}_{m\times p})$ , with the $j$ ’th row of $\mathbf{X}$ written as $\mathbf{x}_{j}^{\top}$ . The response vector $\mathbf{y}$ is related to the unobserved utility vector $(u_{1},u_{2},\ldots,u_{m})$ through the linear model: $u_{j}=\mathbf{x}_{j}^{\top}\beta_{0}+\epsilon_{j}$ .

Under certain assumptions on the distribution of $(\mathbf{y}_{1\times m},\mathbf{X}_{m\times p})$ (see e.g. Assumption 2 of Manski [1975] or the more relaxed version, Assumption 1 of Fox [2007]), which stipluates that the joint density of $(\epsilon_{1},\epsilon_{2},\ldots,\epsilon_{m})$ conditional on $\mathbf{X}$ is exchangeable, it can be shown that the probability of choosing the $i^{th}$ utility is driven by the ordering of the deterministic part of the utility function. This is formalized in the rank ordering property described below.

Assumption 1.1 (Rank ordering property).

Define $p(j|\mathbf{X},\beta)$ as the probability of the $j^{th}$ product having maximum utility under a generic regression parameter $\beta$ and conditional on $\mathbf{X}$ being the covariate matrix:

[TABLE]

The rank ordering property says: $p(j|\mathbf{X},\beta_{0})\geq p(k|\mathbf{X},\beta_{0})$ if and only if $\mathbf{x}_{j}^{\top}\beta_{0}\geq\mathbf{x}_{k}^{\top}\beta_{0}$ . Note that the probability is taken over the joint distribution of $\{\epsilon_{k}\}_{k=1}^{m}$ given $\mathbf{X}$ .

This motivates the estimation of the true parameter $\beta^{0}$ by maximizing the following score function:

[TABLE]

This is a natural generalization of the maximum score to multinomial responses. The idea is to find a $\beta$ that is most commensurate with the observed data. If $j(i)$ is the observed utility for the $i$ ’th individual, only the $j(i)$ ’th term in the inner sum is relevant, and given this information, we look for $\beta$ that makes the deterministic part of the $j(i)$ ’th utility larger than those of most other utilities across all $n$ observations. Hence, with enough data, any maximizer of $S^{(mult)}_{n}(\beta)$ can be expected to be close to $\beta^{0}$ with high probability under Assumption 1.1.

We also note that this directly reduces to the binary response model presented at the beginning, when $m=2$ . In this case, there are only two utility values for the $i^{th}$ individual who chooses the first option only if $u_{i,1}>u_{i,2}$ . Now,

[TABLE]

and hence, taking $X_{i}=(\mathbf{x}_{i,1}-\mathbf{x}_{i,2})$ , $\epsilon_{i}=(\epsilon_{i,1}-\epsilon_{i,2})$ and $Y_{i}$ to be a binary response which takes value $1$ when item $1$ is chosen and [math] otherwise, we recover the binary response model as mentioned in equation (1.1) via a simple linear transformation.

There is a vast literature, especially in economics, which deals with the discrete choice model, although most of it is confined to the binary response model. Lee Lee [1995] extended the analysis of Klein and Spady [1993] for the binary response model to the multinomial case under an appropriate version of the assumptions in the latter paper to obtain a consistent and asymptotically normal semi-parametric efficient estimator. Fox (Fox [2007]) proved the consistency of the maximum score estimator for the multinomial response model under a partially missing data assumption, where the chosen utility and a subset of alternative utilities are observed, without Manski’s assumption of conditionally independent errors (Assumption 2 of Manski [1975]). Recently, Yan (Yan and Yoo [2019]) extended the analysis of Horowitz (Horowitz [1992]) to establish asymptotic normality of a kernel smoothed estimator in the multinomial model.

To the best of our knowledge, all previous work on the binary as well as the multinomial discrete choice model has been done under the setting of fixed dimensional covariates and in the latter model, also under a fixed number of utilities. Our motivation for studying the maximum score estimator in these models is two-fold. Firstly, the score estimator works under very mild conditions on the underlying data generating mechanisms (particularly, through the flexible dependence of the error given the covariate), and is therefore robust to model-misspecification as a consequence of which it has attracted the attention of multiple researchers in both economics and statistics. Through a study of this model in growing dimensions, and results on the concentration properties of the estimator as well as minimax estimation rates in this problem, we provide a novel and interesting direction to the literature on this topic, which we hope will be carried forward by others interested in this genre of problems. Second, from a purely statistical point of view, the score estimator is one of the classic examples of non-regular estimators which arise either through the optimization of criterion functions that are discontinuous in the parameter (note the indicator functions involved in $S_{n}$ and $S_{n}^{mult}$ ), or through optimization problems where the estimator falls on the boundary of the parameter space (e.g., in modern statistical problems involving convex optimization where the estimator lies on a face of a convex cone or more generally a convex set). Such estimators have been known in the literature from as early Chernoff’s work in the 1960s (e.g. see Chernoff [1964]), and were investigated through an integrated approach by Kim and Pollard Kim et al. [1990], in the specific setting of ‘cube-root asymptotics’ – the estimators treated in that paper demonstrated an $n^{1/3}$ convergence rate and non-Gaussian limits – and an important example in that paper was the maximum score estimator. There have been a variety of related developments but all work in this arena has also been in the fixed dimension paradigm. Our current study of the score estimator, to the best of our knowledge, is the first example of a systematic study of a non-regular estimator in growing dimensions. While concentration and minimaxity properties have been dealt with quite thoroughly, inferential questions remain open, and we view our contributions as an important foray into hitherto uncharted territory, but we are only scratching the tip of an iceberg.

**Major findings: ** Here we articulate our findings and give a brief description about the organization of the rest of the paper. We note at the outset that the $\ell_{2}$ metric is a natural measure of distance in this problem since the angle between two unit-norm vectors, which measures their directional divergence, is a function of the $l_{2}$ norm of their difference.

Section 2.1 deals with the moderate growth setting i.e. $p=o(n)$ , while Section 2.2 investigates the fast growth regime: $p\gg n$ . In the moderate growth setting, we establish the rate of convergence of the maximum score estimator in the $\ell_{2}$ norm in terms of $(n,p,C_{n})$ along with an exponential concentration bound, where $C_{n}$ is a sequence of constants appearing in Assumption 2.2 assumed non-increasing in terms of $n$ . The magnitude of $C_{n}$ calibrates the difficulty of the estimation problem: sequences with $C_{n}$ bounded away from 0 present the hardest problems while $C_{n}$ decreasing to 0 makes the estimation problem easier, which reflects in the convergence rate derived in Theorem 2.6. An elaborate discussion on Assumption 2.2 and comparisons to a standard low noise Assumption (Assumption 2.1) is provided in Section 2. We also establish both minimax lower and upper bounds for estimating $\beta^{0}$ and show that the maximum score estimator is minimax optimal up to a log factor. Furthermore, when $C_{n}\equiv C$ , which is later argued to be statistically the most interesting regime, we are able to construct an alternative estimator with minimax optimal rate of convergence.

In the $p\gg n$ regime, we demonstrate that under a sparsity constraint, an appropriate penalized risk minimization method provides a super-set of the active covariates with exponentially high probability. As before, we derive an exponential concentration bound for the penalized maximum score estimator in the $\ell_{2}$ norm, which now depends on $s_{0}$ , the sparsity of $\beta_{0}$ , in addition to $n,p,C_{n}$ . Here also, smaller values of $C_{n}$ translate to improved convergence rates. We derive minimax lower and upper bounds which are again discrepant up to a log factor.

In Section 3 we deal with the multinomial response model. Assumption 1.1 guarantees the uniqueness of the population maximizer, while Assumptions 3.2 and 3.3 are modified versions of Assumption 2.2 and 2.3 tailored for the multinomial response model. Under these modified assumptions, we establish finite sample concentration bounds for the score estimator both in the slowly growing regime and the fast growing regimes. When $m=2$ , our obtained rates of convergence reduce to those obtained for the binary response model in Section 2.

In Section 4, we present some simulation results for the binary choice model. As mentioned earlier, the maximum score estimator can not be computed in polynomial time in the dimension, owing to the discontinuity of the loss function $L_{n}$ defined previously in this section. A standard approach is to compute an approximate solution by minimizing a convex surrogate of the $0\mbox{-}1$ loss, as is evident from the copious amount of work in both the statistics and machine learning literatures on this topic (see e.g. Friedman et al. [2001]): e.g., logistic regression replaces $0\mbox{-}1$ loss by the logit loss, SVM uses the hinge loss, while adaboost relies on the exponential loss. Another direction involves smoothing the $0\mbox{-}1$ loss via some distribution kernel (which makes the loss function differentiable) and computing the minimizer by some variant of gradient descent. Recently, a homotopic path following approach to this problem has been proposed in Feng et al. [2019]. We present a comparative study of three methods: SVM, logistic regression and the homotopic path following algorithm mentioned above. The main take away from this simulation study is that SVM performs better than logistic regression when $p/n\rightarrow 0$ under heterogeneity of errors, while the performance of the method proposed in Feng et al. [2019] is comparable to SVM for $p\gg n$ . As a matter of fact, the method based on homotopic path following performs somewhat better than SVM, but its run-time is also higher.

Section 5 presents a brief discussion of certain aspects of our work including certain natural extensions, some of which are elaborated on in the supplement, as well as future challenges of this direction of research. Section B presents the proofs of two key results while the remaining proofs are relegated to the supplement in the interests of space.

2 Asymptotic properties and minimax bounds

We now present concentration and rate of convergence results for the maximum score estimator in the binary response model in growing dimensions. To that end, we start with some assumptions on the distribution on $X$ and the behavior of $P(Y=1|X)$ near the Bayes hyperplane $X^{\top}\beta^{0}=0$ , which play a central role in the subsequent development. To control the behavior of $P(Y=1|X)$ , we introduce a version of Tsybakov’s low noise assumption (Mammen et al. [1999], Tsybakov et al. [2004]) which has been used extensively in the classification literature. For the sake of convenience of the reader we first state the regular low noise condition below.

Assumption 2.1 (Soft margin Assumption).

Let ${\mathbb{P}}$ denote the joint distribution of $(X,\epsilon)$ in dimension $p\equiv p_{n}$ . Then, with $\eta(X):={\mathbb{P}}(Y=1|X)$ ,

[TABLE]

for some constant $C$ and $0<t^{\star}<1/2$ and $\alpha>0$ .

The soft margin condition quantifies how the conditional class probability deviates from $1/2$ near the Bayes’ hyperplane in terms of a smoothness parameter $\alpha>0$ . Larger values of $\alpha$ translate to sharper changes of $\eta(X)$ around the Bayes’ hyperplane and correspond to easier classification problems. For reasons to be explained below, we do not work with the above condition but a slightly tuned version of it:

Assumption 2.2 (Transition condition).

Let ${\mathbb{P}}$ denote the joint distribution of $(X,\epsilon)$ in dimension $p\equiv p_{n}$ . Then, with $\eta(x):={\mathbb{P}}(Y=1|X=x)$ ,

[TABLE]

where $\{C_{n}\}$ is a bounded sequence of constants, and $t^{\star}$ lies strictly between 0 and $1/2$ .

**Discussion of Assumption 2.2: ** To understand the effect of $C_{n}$ , consider the special case when $C_{n}=C$ , a fixed constant. Then, the modified condition is just the low noise condition with smoothness parameter $\alpha=1$ . Next, consider a situation where $C_{n}$ decreases to 0 with $n$ (we view $p$ as a function of $n$ ). In this case, the transition of $\eta(x)$ from below $1/2-t$ on one side of the hyperplane to above $1/2+t$ on the other side is sharper compared to the fixed $C$ case, since the probability mass assigned by the covariate distribution to the region where $\eta(x)$ is close to $1/2$ is of a smaller order than with fixed $C$ . This translates to an easier estimation problem as $n$ grows, and a corresponding improved rate of estimation: the smaller the order of $C_{n}$ , the faster the rate. In fact, $C_{n}=0$ , corresponds to a jump around the Bayes’ hyperplane and a best possible rate of order $1/n$ in fixed dimension. On the other hand, when $C_{n}$ is large, ${\mathbb{P}}\left(\left|\eta(X)-\frac{1}{2}\right|\leq t\right)$ is substantially larger, which implies the presence of a fair amount of fuzziness near the Bayes’ hyperplane – there is now a substantial mass of points around the hyperplane with $\eta(x)$ values very close to $1/2$ which are hard to classify – resulting in a slower rate of estimation.

The transition condition captures the intrinsic difficulty of the estimation problem in terms of the sequence of constants $\{C_{n}\}$ whereas the low-noise condition describes it in terms of the exponent $\alpha$ of $t$ , with larger values of $\alpha$ corresponding to easier estimation problems (enhanced convergence rates for larger $\alpha$ ). Both formulations therefore capture the same phenomenon, albeit in somewhat different manner. Note that, the low noise assumption was originally formulated (Mammen et al. [1999]) to deal with irregular boundaries, whereas, our condition is more naturally tuned to smooth hyperplane boundaries in discrete choice model. Our reason for favoring the modified low noise condition is that it is much more intuitive and allows a clean and integrated presentation of the minimax rates of convergence in this problem in terms of $\{C_{n}\}$ , which does not appear to be the case with the low noise assumption. For a slightly different treatment of this problem under Assumption 2.1, see a previous draft of this manuscript Mukherjee et al. [2019].

We now show that the case $C_{n}=C$ in our transition condition arises naturally for a rich family of distributions under some natural assumptions. Observe that, the family of distributions with margin condition involving $C_{n}\downarrow$ is a sub-class of the family of distributions with $C_{n}\equiv C$ . Assume, for example, that (a) $X\perp\perp\epsilon$ and the density of $\epsilon$ , say $f$ , does not depend on $p$ ; (b) $f(x)\geq c_{\delta}>0$ on $(-\delta,\delta)$ for some $\delta>0$ ; (c) the density of $X^{\mathsf{T}}\beta^{0}$ is bounded by a positive number $\leq k$ on $(-\delta^{\prime},\delta^{\prime})$ for some $\delta^{{}^{\prime}}>0$ , with $k,\delta^{{}^{\prime}}$ not depending on $p$ . Then, for $0\leq t\leq t^{*}$ , where $\delta\wedge\delta^{\prime}>(F_{\epsilon}^{-1}(0.5+t^{*})\vee-F_{\epsilon}^{-1}(0.5-t^{*}))$ :

[TABLE]

which is the condition corresponding $C_{n}=C=2k/c_{\delta}$ .

Using an inverse-Lipschitz type condition, one can also let $\epsilon$ depend on $X$ . Suppose that the conditional distribution of $\epsilon$ given $X$ satisfies:

[TABLE]

for some $C,\xi>0$ independent of $p$ , for almost surely $X\sim{\mathbb{P}}_{X}$ . This holds, for example, if for $P_{X}$ almost all $x$ , the conditional density $f_{\epsilon|X=x}(\zeta)\geq c>0$ on a fixed neighborhood $(-\delta^{\prime},\delta^{\prime})$ around [math], with $(c,\delta)$ not depending on $p$ . The transition condition is now satisfied for fixed $C$ under the same condition on the density of $X^{\mathsf{T}}\beta^{0}$ as before. An example of the dependence requirement of $\epsilon$ on $X$ is $\epsilon|X=x\sim N(0,1+(\|x\|_{2}\wedge 1))$ .

Our next assumption regarding the marginal distribution of $X$ is that the probability of the wedge shaped region between the true hyperplane and any other hyperplane under the distribution of $X$ is related to the angle between the corresponding normal vectors.

Assumption 2.3 (Distribution assumption on covariates).

The distribution of $X$ satisfies the following condition:

[TABLE]

for all $\beta\in S^{p-1}$ , where the constant $c_{1}>0$ , does not depend on $n,p$ .

**Discussion of Assumption 2.3: **The above assumption plays a critical role in this paper, relating the underlying geometry in the problem to the probability distribution of the covariates. It is used, for example, in the below proposition, to relate the curvature of the population score function around its maximizer $\beta_{0}$ to the angle between $\beta_{0}$ and a generic unit vector $\beta$ . The magnitude of the curvature plays a pivotal role in deriving the rate of convergence of Manski’s estimator in both the slow and fast growth regimes (Theorems 2.6 and 2.14 respectively), where upper tail probabilities for $\|\hat{\beta}-\beta_{0}\|$ are related to upper tail probabilities for $S(\beta_{0})-S(\hat{\beta})$ (which is also the difference in the population risks at these two vectors) via Assumption 2.2. In that respect, this assumption can be viewed as an analogue of the compatibility or restricted eigenvalue condition in the classical high-dimensional linear regression problem, which helps convert bounds on the prediction error of the Lasso estimator to its estimation error. In this context, it is interesting to consider a specific violation of the assumption: namely, when ${\mathbb{P}}_{X}(\text{sgn}(X^{\mathsf{T}}\beta)\neq\text{sgn}(X^{\mathsf{T}}\beta^{0}))=0$ for all $\beta$ sufficiently close to $\beta_{0}$ in angular distance. In this case, if $X$ for example is supported on a compact domain, it is not difficult to see that one can perturb the Bayes hyperplane by small rotations, but as the corresponding wedges will not have any mass under $P_{X}$ , there are no points available in such regions, and the Bayes hyperplane cannot be even uniquely identified. Examples of families of distributions (e.g. elliptically symmetric $X$ ) that satisfy Assumption 2.3 are available in Section 5.

Proposition 2.4.

Under Assumptions 2.2 and 2.3, the curvature of the population score function around the truth satisfies:

[TABLE]

for all $\beta\in S^{p-1}$ , where $d_{\Delta}(\beta,\beta^{0})={\mathbb{P}}_{X}(\text{sgn}(X^{\top}\beta)\neq\text{sgn}(X^{\top}\beta^{0}))$ and $t^{*},C_{n}$ are same constants defined in Assumption 2.2.

The proof of this proposition relies on relating $S(\beta)-S(\beta^{0})$ to ${\mathbb{P}}_{X}(\text{sgn}(X^{\top}\beta)\neq\text{sgn}(X^{\top}\beta^{0}))$ via Assumption 2.2, and the latter to $\|\beta-\beta^{0}\|_{2}$ , via Assumption 2.3. One takeaway from the proposition is that the excess risk is lower bounded by a dichotomous distance in terms of $\|\beta-\beta_{0}\|_{2}$ and $d_{\Delta}(\beta,\beta_{0})$ . For $\beta$ close to $\beta_{0}$ in the sense that $d_{\Delta}(\beta,\beta_{0})$ is small relative to $C_{n}$ , we have a quadratic curvature whose sharpness is determined by the magnitude of $C_{n}$ , while for $\beta$ away from $\beta_{0}$ , the curvature is linear. As we will see below, the dichotomous nature of the distance imposes a natural lower bound on the estimation error of the maximum score estimator, irrespective of how small $C_{n}$ is.

Remark 2.5.

Note that when $C_{n}=C$ fixed and assuming without loss of generality $2t^{*}C>1$ , we conclude:

[TABLE]

for all $\beta\in S^{p-1}$ . This same condition can be achieved using Assumption 2.1 with $\alpha=1$ .

2.1 Rate of convergence when: $p/n\rightarrow 0$

We first establish a rate of convergence for $\hat{\beta}$ .

Theorem 2.6.

Let $\hat{\beta}_{n}$ and $\beta_{0}$ are the maximizer of $S_{n}(\beta)$ and $S(\beta)$ respectively. Then under Assumptions 2.2 and 2.3, for some constant $K>0$ (not depending on $n,p$ ):

[TABLE]

for all $y\geq 1$ , where:

[TABLE]

This implies that,

[TABLE]

where $K_{1}>0$ is some constant which depends on the model constants $t_{*},c_{1}$ introduced in the assumptions and some other universal constants. Note that the supremum in the above display is taken over all distributions $P$ corresponding to binary response models satisfying Assumptions 2.2 and 2.3 for some regression parameter $\beta\in\mathcal{S}^{p-1}$ (viewed as a functional of $P$ ) but with $t^{*},c_{1}$ held fixed.

Remark 2.7.

Note that our rate of convergence depends on three parameters $(n,p,C_{n})$ . To understand the implications of the obtained expression for the rate, assume initially that $C_{n}$ is a constant (statistically of primary interest as discussed in Section 1), assumed without loss of generality to be 1. Then the value of $r_{n}$ reduces to:

[TABLE]

and

[TABLE]

*Hence, up to a log factor, we recover the analogue of the cube-root rate for growing dimension.

One may wonder what is the best possible rate that can be obtained from the above expression. An inspection of the rate expression immediately implies that we cannot improve upon $n/(p\log{(n/p)})$ , the high dimensional rate analogue of change-point estimation. Some more insight can be gleaned by ignoring the log-factor in the rate expression. In that case,

[TABLE]

and the rate of convergence based on this approximation is given by

[TABLE]

which is shown to be the minimax optimal in Theorem 2.11. The above equality follows from the observation that, if $C_{n}\geq p/n$ , then $(n/pC_{n}^{2})^{1/3}$ is minimum among the four terms, while $(n/p)$ is the minimum otherwise. This indicates that the rate of convergence improves with decreasing $C_{n}$ , but only up to $n/p$ modulo a log factor.

Alternatively, one can study the exact expression for the rate by taking special but natural choices for $p,C_{n}$ in terms of $n$ . Concretely, let $p\sim n^{\tau}$ and $C_{n}\sim n^{-\lambda}$ for $0<\tau<1$ and $\lambda>0$ . Note that $C_{n}$ is of order larger than $p/n$ when $\lambda>1-\tau$ , in which case some simple algebra shows the rate of convergence $(r_{n}/\sqrt{C_{n}})\wedge r_{n}^{2}$ to be $((n/pC_{n}^{2})/\log(n/pC_{n}^{2}))^{1/3}$ . On the other hand when $\lambda\leq 1-\tau$ , i.e. $C_{n}$ is of the same or lower order than $p/n$ , the rate of convergence becomes $(n/p)/\log(n/p)$ .

The proof of Theorem 2.6 relies on a concentration inequality (Theorem 2 from Massart et al. [2006]) to obtain a bound on the excess risk $S(\hat{\beta})-S(\beta^{0})$ , which, along with Assumption 2.3, yields a concentration bound on $\|\hat{\beta}-\beta^{0}\|_{2}$ . A natural question that arises here is whether the logarithm in the above rate, which arises from the effect of growing dimension on the shattering numbers of the linear classifiers involved, can be dispensed with. While it is unclear whether the exact $(pC^{2}_{n}/n)^{1/3}\vee(p/n)$ rate is achievable, we demonstrate, in what follows, that for $C_{n}=C$ , it is possible to construct an estimator whose rate of convergence is $(p/n)^{1/3}$ under the following additional assumption.

Assumption 2.8.

We impose some further constraints on the distribution of $X$ and the population score function:

The distribution of $X$ satisfies

[TABLE]

for all $\{\beta:\|\beta-\beta_{0}\|_{2}\leq 1\}$ , where the constant $C^{\prime}>0$ does not depend on $n$ and $p$ . 2. 2.

For some small $u_{0}>0$ ,

[TABLE]

for all $\{\beta:\|\beta-\beta_{0}\|_{2}\leq u_{0}\}$ , where the constant $u_{+}>0$ does not depend on $n$ and $p$ .

The construction of the estimator can be briefly described as follows: Generate (enough) points randomly on the surface of the unit sphere, such that with high probability some of the generated points are in a sufficiently small neighborhood of $\beta^{0}$ . Then, maximize the empirical score function on the generated points. We show in the following theorem that this empirical maximizer converges to the truth at rate $(p/n)^{1/3}$ :

Theorem 2.9.

Suppose the margin condition (Assumption 2.2) is satisfied for $C_{n}=C$ fixed, and that Assumptions 2.3 and 2.8 hold. Then, there exists an estimator $\tilde{\beta}$ , which can be constructed by the above recipe [with technical details of the construction available in the proof], such that

[TABLE]

Remark 2.10.

Assumption 2.8 (2) as well as the construction of the grid estimator take into account the the fact that $C_{n}=C$ fixed. In that sense, the new estimator is not adaptive, whereas the maximum score estimator is agnostic to the value of $C_{n}$ . We believe that the log factor in the convergence rate is the price paid for adaptivity. For more insight into the Assumption 2.8, see Section 5.

Finally, we show that the generic minimax lower bound for this estimation problem (i.e the $C_{n}$ ’s are not restricted to be constant) is $\left(pC^{2}_{n}/n\right)^{1/3}\vee\left(p/n\right)$ i.e. we cannot estimate the linear discriminator at a better rate without more assumptions:

Theorem 2.11 (Minimax Lower bound).

We have :

[TABLE]

for some constant $K_{L}$ that does not depend on $(n,p)$ . For $C_{n}=C$ fixed, the lower bound is of the order $(p/n)^{2/3}$ . The supremum is taken over the same class of distributions as in Theorem 2.6.

Remark 2.12.

The proof of the above result relies on constructing competing models from the collection of distributions that approach each other at the optimal rate, $(pC_{n}^{2}/n)^{1/3}$ . The core challenge lies in constructing these alternative models with sufficient care, and then invoking Assouad’s lemma (e.g. see chapter 2 of Tsybakov [2009]) to establish the rate. The same minimax rate is true for the smaller class of distributions formed by intersecting $\mathcal{P}$ with the class of distributions satisfying Assumption 2.8 for some positive constants $C^{\prime},u_{0},U_{+}$ , since the local alternatives constructed in the proof satisfy this assumption as well. Therefore, the grid estimator is minimax rate optimal for this smaller class of distributions.

2.2 Rate of convergence when $p\gg n$

We now turn to the case where $p$ , the dimension of the covariate vector, is larger than $n$ . In this case, meaningful estimation and inference is only possible under structural assumptions on $\beta^{0}$ that regulate its complexity relative to the size of the data, and any meaningful estimation procedure needs to incorporate this constraint. Usually, such structural assumptions are handled by imposing a penalty on the underlying loss function. The most natural structural constraint on a regression type parameter is one of sparsity, i.e. only a small subset of the co-ordinates of $\beta_{0}$ influence the response (i.e. are different from zero). In the high-dimensional linear regression or GLM framework, the natural loss function is convex and the standard approach is to penalize the (convex) $\ell_{1}$ norm of the parameter, which gives rise to a clean convex optimization problem with a well-characterized solution (see Tibshirani [1996], Greenshtein et al. [2004], Van de Geer et al. [2014], Bühlmann and Van De Geer [2011], Bickel et al. [2009], Miolane and Montanari [2018] and references therein). The corresponding optimizers are seen to have desirable statistical properties, e.g. consistency in various norms, minimax convergence rates and so-forth. Furthermore, $\ell_{1}$ penalization is a natural convex relaxation of $\ell_{0}$ penalization which is the most direct approach to the sparsity constraint. Another key feature of high dimensional inference is model selection. Under the sparsity constraint, most variables are inactive and a good model selection algorithm needs to include the active set with high probability but relatively few inactive variables. Though model selection/ feature selection in the high-dimensional linear regression model has been studied extensively over the past two decades (e.g. see Zhao and Yu [2006], Huang et al. [2008], Wei and Huang [2010], Yuan and Lin [2006], Zhang et al. [2008] and references therein), the problem remains relatively unaddressed in the classification set-up.

Be that as it may, the optimization problem that produces the maximum score estimator is not only non-differentiable and non-convex, it is actually discontinuous and therefore adding a convex penalty like $\ell_{1}$ affords no computational advantage. While one possible route is to use an $\ell_{1}$ penalized version of a kernel smoothed loss function $L_{n}$ (following the line of work of Horowitz Horowitz [1992]), an approach that has recently been adapted by Feng et al. [2019] in related problems, our goal in this section is to understand the behavior of the primal non-regular score estimator in high dimensions under minimal assumptions. In what follows, we therefore penalize the score function in a way that is amenable to a proper analysis and produces a sparse estimator with near-optimal rate and a desirable screening property. A smoothed estimator can possibly yield a better convergence rate along with computational benefits but will require substantially stronger assumptions on the model. Recall, for example, that the smoothed score estimator in the fixed $p$ setting as studied in Horowitz [1992] does converge at a faster than $n^{1/3}$ rate to a Gaussian limit, but the model assumptions required to make this work are significantly stronger than Manski’s original assumptions as well as Kim and Pollard’s Kim et al. [1990].

In what follows, we use the structural risk minimization (SRM) approach introduced in Vapnik and Chervonenkis [1974] for variable selection and estimation in this regime, which is closely related to $\ell_{0}$ -penalized risk minimization or the best subset selection problem. Briefly speaking, the SRM approach consists of the following steps:

Start with a large class of functions over which the loss function will be minimized. 2. 2.

Divide this class into nested subsets of increasing complexity, and find empirical risk minimizer for each of these subsets. 3. 3.

Add a penalty (here denoted by pen) based on the complexity of the subclass to the minimum empirical risk for that subclass and return the classifier (and its corresponding subclass) with minimum penalized empirical risk.

The first step generally ensures that there is no bias (or very low bias) in the estimation problem. If one starts with a large function class, it is more likely that the population minimizer will be close (if not identical to) the minimizer within the selected class. But though bias can be largely eliminated by in this manner, the process of searching over a large function class incurs high variability and can lead to pessimistic convergence rates. Therefore, one needs to optimize the bias-variance trade-off, which happens over steps two and three. In step two, nested subsets are considered, hence the minimum value of the empirical risk keeps decreasing as the nesting (complexity) increases. The role of the penalty function is to stabilize the bias-variance trade-off and strike a balance between risk minimization and complexity. The nature of the penalty is typically related to the complexity of the class of functions (complex classes are penalized at higher levels) as well as to the structure of the problem. For parametrically specified classes, one may use the $\ell_{0},\ell_{1}$ or a more general $\ell_{p}$ norm of the parameter, or variants (e.g. Mallow’s CP, AIC, BIC) as a notion of complexity, or may resort to other notions like VC dimension (see e.g. Chapter 8 of Massart and the references therein).

We now describe the details of the implementation of our SRM based method. We start by articulating our assumption on the sparsity of the Bayes’ hyperplane:

Assumption 2.13 (Sparsity Assumption).

There exists $s_{0}$ with $\|\beta^{0}\|_{0}\leq s_{0}$ , where $s_{0}$ depends on $n,p$ in such a way that $\frac{s_{0}\log{p}}{n}\rightarrow 0$ as $n\rightarrow\infty$ .

Under the above assumption, it is reasonable to search among all models with sparsity (by which we mean the number of active coefficients) bounded by $C_{1}\lfloor n/\log{p}\rfloor$ for some universal constant $C_{1}$ . For mathematical simplicity, we take $C_{1}=1/4$ . Let $\mathscr{M}_{i}$ be the collection of all models with sparsity bounded by $i$ for $1\leq i\leq\lfloor n/\log{p}\rfloor$ , i.e.:

[TABLE]

Define $\mathscr{M}$ be the collection of all admissible models, i.e.

[TABLE]

Also define:

$\hat{\beta}_{m}=\mathop{\rm argmin}_{\beta:\|\beta\|_{0}\leq m}\left[-S_{n}(\beta)\right]$ 2. 2.

$\hat{m}=\mathop{\rm argmin}_{1\leq m\leq\lfloor n/\log{p}\rfloor}\left[-S_{n}(\hat{\beta}_{m})+\textbf{pen}(\mathscr{M}_{m})\right]$ 3. 3.

$\beta^{0}_{m}=\mathop{\rm argmax}_{\beta:\|\beta\|_{0}\leq m}S(\beta)$ 4. 4.

$V_{m}=$ VC dimension of the collection $\mathscr{M}_{m}$ . This is of the order $m\log(ep/m)$ .

By the SRM principle, the best possible estimate is given by $\hat{\beta}_{\hat{m}}$ . For the model collection $\mathscr{M}_{i}$ , we use the penalty

[TABLE]

where $K$ is some absolute constant. Up to a (neligible) logarithmic term, the penalty function is proportional to $V_{i}$ , the VC dimension of the model $\mathscr{M}_{i}$ , which captures the richness of this collection. The following theorem provides a finite sample concentration bound of our estimator:

Theorem 2.14.

Let $\hat{\beta}_{\hat{m}}$ and $\beta_{0}$ denote the penalized empirical minimizer and population minimizer of the binary choice model respectively. Then under assumptions 2.2, 2.3 and 2.13, there exist constants $\Sigma,K_{1}>1,K_{2}$ (which are independent of $n,d,s_{0}$ ) such that for all $t\geq 1$ :

[TABLE]

where,

[TABLE]

and $s_{n}$ is a specific sequence of constants going down to 0 (with details available in the proof).

As a consequence of the exponential tail bound, one can establish the following upper bound on the minimax risk:

[TABLE]

where the supremum in the above display is taken over all distributions $P$ corresponding to binary response models satisfying Assumptions 2.2, 2.3 and 2.13 with some regression parameter $\beta\in\mathcal{S}^{p-1}$ (viewed as a functional of $P$ ) with $\ell_{0}$ norm bounded below by $s_{0}$ .

Remark 2.15.

A discussion similar to Remark 2.7 is in order. Here, the rate of convergence depends on four parameters $(n,p,C_{n},s_{0})$ . Assuming $C_{n}=1$ , the value of $r_{n}$ becomes:

[TABLE]

and recalling that $V_{s_{0}}\asymp s_{0}\log(ep/s_{0})$ ,

[TABLE]

*where the last asymptotic equivalence while not always being true, nonetheless holds for most common scenarios. As an example, if we take $s_{0}=n^{\gamma},p=e^{n^{\delta}}$ for some $0<\gamma,\delta<1$ with $\gamma+\delta<1$ , the equivalence is valid. The condition $\gamma+\delta<1$ is forced by Assumption 2.13.

It is immediate that the rate of convergence of our estimator cannot be faster than $n/\left(V_{s_{0}}\log{(n/V_{s_{0}})}\right)$ . As before, one can gain useful insights by ignoring the log-factors in the rate expression. Thus,

[TABLE]

and hence

[TABLE]

As in the case of slowly growing regime, this rate is also shown to be minimax optimal in Theorem 2.18. The last equality follows from the fact that, if $C_{n}\geq V_{s_{0}}/n$ , then $(n/V_{s_{0}}C_{n}^{2})^{1/3}$ is the minimum among the four terms, otherwise $(n/V_{s_{0}})$ is the minimum. This implies that the rate can be made faster by decreasing the value of $C_{n}$ , but cannot be improved upon $(n/V_{s_{0}})$ (up to log factors).

As an immediate corollary of the above theorem, we establish that a superset of true model will be selected with high probability under an appropriate beta-min condition:

Corollary 2.16.

Suppose the minimum non-zero absolute value of $\beta^{0}$ satisfies the following bound:

[TABLE]

with $r_{n}$ is as in Theorem 2.14. Then under Assumptions 2.2,2.3 and 2.13, we have:

[TABLE]

where $m_{0}$ is the true active set. This probability goes to 1 exponentially fast in $n$ .

Remark 2.17.

Note that for $C_{n}=C$ fixed, the lower bound on $\beta^{0}_{\min}$ is proportional to $\left((s_{0}\log{p}\log{n})/n\right)^{1/3}$ which is same as the rate of convergence of $\hat{\beta}_{\hat{m}}$ (see Theorem 2.14). This should be compared to the $\beta^{0}_{\min}$ condition derived from the $\ell_{2}$ convergence analysis in high dimensional linear regression which is $(s_{0}\log{p}/n)^{1/2}$ . The slower convergence rate in this problem requires a more pronounced separation of the active coefficients of $\beta^{0}$ from the inactive ones in comparison to standard linear regression, to guarantee the screening property.

Our next result provides a lower bound on the minimax error rate.

Theorem 2.18.

We present our minimax lower bound result for the fast growth regime $p\gg n$ :

[TABLE]

for some constant $\tilde{K}_{L}>0$ not depending on $(n,p,s_{0})$ . For the case $C_{n}=C$ fixed, the lower bound is of the order of $\left(\frac{s_{0}\log{(p/s_{0})}}{n}\right)^{2/3}$ . The supremum is taken over the same class of distributions as in Theorem 2.14.

Remark 2.19.

As with the minimax lower bound proof in the moderate growth case, the proof of this theorem also relies on the construction of a sequence of competing models that approach one another, along with Fano’s inequality (Chapter 2 of Tsybakov [2009]). Similar to the moderate growth regime, by comparing the lower bound above to the rate of the score estimator in Remark 2.15, we find that the former is better only by a logarithmic factor, which suggests that the penalized maximum score estimator is almost minimax optimal.

3 Multinomial discrete choice model

The score function for the multinomial discrete choice model, as explained in the Section 1, is given by:

[TABLE]

and the corresponding population version ${\mathbb{E}}S^{(mult)}_{n}(\beta)$ is given by:

[TABLE]

where $p_{j}(\mathbf{X})=p(j|\mathbf{X},\beta_{0})={\mathbb{P}}(Y_{i,j}=1|\mathbf{X}_{i}=\mathbf{X})$ and $m$ denotes the number of choices.

In what follows, $m$ and $p$ should be viewed as growing as functions of $n$ . The following proposition establishes that $\beta^{0}$ is indeed the unique maximizer of the population score function:

Proposition 3.1.

Under Assumption 1.1 we have, $\beta^{0}=\mathop{\rm argmax}_{\beta:\|\beta\|_{2}=1}S^{(mult)}(\beta)$ , and that the maximizer is unique.

Let $\hat{\beta}_{n}$ denote a maximizer of $S^{(mult)}_{n}(\beta)$ . This is a slight abuse of notation as $\hat{\beta}_{n}$ also is used to indicate the ERM estimator for the binary counterpart of this model. In this section, $\hat{\beta}_{n}$ will unambiguously denote the ERM estimator for the multinomial choice model. We state a set of further assumptions (which should be viewed as natural extensions of our assumptions for the binary case) to facilitate the asymptotic analysis of $\hat{\beta}_{n}$ .

Assumption 3.2 (Transition condition).

The multinomial choice model satisfies the modified transition condition uniformly for all pairs $(j,k)$ , i.e. there exists constants $C>0$ (not depending on $n$ ) such that for every $(j,k)$ :

[TABLE]

for all $0\leq t\leq t^{*}$ . We assume $2t^{*}C>1$ for mathematical simplicity.

Assumption 3.3 (Restricted wedge assumption).

There exist constants $c_{1},c_{2},c_{3}>0$ and $R>0$ such that:

$p_{j}(\mathbf{X})\geq c_{1}>0$ * for all $j\in\{1,2\dots,m\}$ for all $\|\mathbf{X}\|_{F}\leq R$ , where $\|\;\|_{F}$ denotes Frobenius norm. Here the constant $c_{1}$ depends $m$ , while the radius of choice $R$ does not depend on the specific utility, but may or may not depend on $p$ . * 2. 2.

*For all pairs $(j,k)$ , ${\mathbb{P}}\left(\left\{\text{sgn}((\mathbf{x}_{j}-\mathbf{x}_{k})^{\prime}\beta)\neq\text{sgn}((\mathbf{x}_{j}-\mathbf{x}_{k})^{\prime}\beta_{0})\right\}\cap\{\|\mathbf{X}\|_{F}\leq R\}\right)\geq c_{2}\|\beta-\beta^{0}\|_{2}$ where $c_{2}>0$ does not depend on $n$ . * 3. 3.

The effect of radius $R$ is asymptotically non-vanishing, i.e. for all pairs $(j,k)$ :

[TABLE]

and the constant $c_{3}$ does not depend on $n$ .

Remark 3.4.

Assumption 3.2 should be viewed as the multinomial version of Assumption 2.2. It quantifies the probability mass of the covariate space where the magnitude of the difference between $p_{j}(\mathbf{X})$ and $p_{k}(\mathbf{X})$ is small relative to their sum, in terms of a generic threshold $t$ . This is easily seen by noting that $|p_{j}(\mathbf{X})/(p_{j}(\mathbf{X})+p_{k}(\mathbf{X}))-1/2|=\left|p_{j}(\mathbf{X})-p_{k}(\mathbf{X})\right|/(p_{j}(\mathbf{X})+p_{k}(\mathbf{X}))$ . The smaller this quantity, the harder it is to differentiate between utilities $k$ and $j$ . We note that for the multinomial problem we confine ourselves to a fixed $C$ (as opposed to a general sequence $C_{n}$ ) in our low-noise assumption which allows a cleaner and less cumbersome presentation of our results. As in the binary case, the fixed $C$ assumption is statistically the most interesting version. The proof for a general $C_{n}$ that goes to 0 would work similarly as for fixed $C$ , except for the fact that we would now need to keep explicit track of the $C_{n}$ throughout the steps of the proof.

Remark 3.5.

It is clear that Assumption 3.3 is in similar vein to Assumption 2.3 for the binary response model, albeit somewhat more involved owing to the multinomial structure. Part (1) of Assumption 3.3 postulates a ball of radius $R$ around the origin in $\mathbb{R}^{m\times p}$ where the probability of choosing any specific utility given $\mathbf{X}$ is bounded away from 0: i.e., every alternative can be chosen with non-negligible probability, or in other words, all utilities are competitive. Part (2) of the assumption resembles Assumption 2.3 exactly, modulo the fact that we are now interested in the wedge-shaped region within the ball of radius $R$ . This is because part (1) of the Assumption restricts the main action to the ball of radius $R$ where the probability of choosing any item is non-negligible. Part (3) of the Assumption ensures that, the probability of the wedge-shaped region intersected with a ball of radius $R$ is not negligible with respect to the probability of the entire wedge-shaped region. In other words, the region of primary action is non-ignorable with respect to the entire region. This assumption helps us establish an upper bound on the variability of the empirical process relevant to our analysis of the concentration bounds for the estimator.

Remark 3.6.

The results in this section presented below can also be derived by taking $R=\infty$ in Part (1) of Assumption 3.3. In this case, Part (1) becomes stronger as we now assume the lower bound on the conditional probabilities of choosing utilities for all $\mathbf{X}$ . On the other hand, Part (2) of the assumption is weakened: if the lower bound in Part (2) holds for finite $R$ , it holds for $R=\infty$ . Part (3) of the assumption is trivially satisfied for $R=\infty$ with $c_{3}=1$ .

Remark 3.7.

The assumption that the conditional probability $p_{j}(\mathbf{X})$ of each utility is bounded away from 0 (Part (1) of Assumption 3.3) can be easily relaxed. For example one may assume that $p_{j}(\mathbf{X})\vee p_{k}(\mathbf{X})\geq c_{1}$ for all $\|X\|_{F}\leq R$ for all $1\leq j\neq k\leq m$ without disturbing any of our calculations. Indeed, an inspection of the proof of Proposition 3.8 shows that what we crucially require to establish the curvature of $S^{(mult)}(\beta)-S^{(mult)}(\beta_{0})$ is a lower bound on $p_{j}(\mathbf{X})+p_{k}(\mathbf{X})$ for $\|X\|_{F}\leq R$ , and this is obviously true under the relaxed assumption. For the binary choice model $m=2$ this weaker assumption is automatic: $p_{1}(X)\vee p_{2}(X)\geq 1/2$ for all $X$ , so that we can clearly take $R=\infty$ and Assumption 3.3 boils down to Assumption 2.3.

The following Proposition (similar to Proposition 2.4) establishes a lower bound on the excess population risk:

Proposition 3.8.

Under Assumptions 1.1, 3.2 and 3.3, we have the following curvature condition for multinomial choice model:

[TABLE]

for all $\beta\in S^{p-1}$ where $c_{1},c_{2},C,t^{*}$ are same constants as mentioned in Assumption 3.2 and 3.3.

The proof of the above proposition is conceptually similar to that of Proposition 2.4, as it relies on relating $S^{(mult)}(\beta_{0})-S^{(mult)}(\beta)$ to the average of the probabilities of truncated wedge-shaped regions for all possible pairs of $m$ utilities. Note that for $m=2$ , this corresponds to a single wedge-shaped region. The average probability is then bounded using Part (2) of Assumption 3.3 to conclude the proof.

Theorem 3.9 (When $p/n\rightarrow 0$ ).

If $nc_{1}^{2}/(m^{2}p)\to\infty$ , then under Assumptions 1.1, 3.2 and 3.3, we have:

[TABLE]

for all $y\geq 1$ , where

[TABLE]

and $c_{1}$ is the same constant defined in Assumption 3.3, $K$ is some constant which does not depend on $n$ .

Remark 3.10.

*It is instructive to relate this theorem with its counterpart for the binary choice model, Theorem 2.6. For the case $m=2$ , it is clear from Remark 3.7 that we can always take $c_{1}=1/2$ , and the rate of convergence becomes $\left(n/(p\log{(n/p)})\right)^{1/3}$ which is identical to that from Theorem 2.6 when $C_{n}$ is fixed (See remark 2.7). The additional term in the current rate, viz. $(c_{1}/m)^{2}$ can be viewed as an adjustment for the number of utilities. Notice that $c_{1}$ itself depends non-trivially on $m$ : as $\sum_{j=1}^{m}p_{j}(\mathbf{X})=1$ , we need $c_{1}\leq 1/m$ for Part 1 of Assumption 3.3 to make sense. *

Next, we present our result for the fast growth regime, i.e. when $p\gg n$ . As before, we require a sparsity assumption for the identification of the model. Our following assumption encodes the rate at which we can allow the sparsity to grow for our asymptotic analysis:

Assumption 3.11 (Sparsity condition for Multinomial model).

Under the fast growth regime, i.e. when $p\gg n$ , we assume that there exists $s_{0}$ with $\|\beta^{0}\|_{0}=s_{0}$ which satisfies:

[TABLE]

as $n\rightarrow\infty$ .

This assumption is identical in spirit to Assumption 2.13. The only difference is that now both $m$ and $c_{1}$ play a role in determining the permissible rate of sparsity of the true vector $\beta^{0}$ . As mentioned before, the factor $m^{2}$ appears due to pairwise comparison of utilities and the factor $c_{1}$ relates to the curvature condition established in Proposition 3.8.

Theorem 3.12 (When $p\gg n$ ).

Under Assumptions 1.1, 3.2 and 3.3, there exists a constant $\Sigma>0,K_{1}>1,K_{2}>0$ (not depending on $n$ ) such that for all $y\geq 0$ :

[TABLE]

where,

[TABLE]

*and $s_{n}\to 0$ as $n\to\infty$ . *

Remark 3.13.

Recall from Remark 3.10, when $m=2$ , one can always take $c_{1}=1/2$ . The rate of convergence then becomes $\left(n/(s_{0}\log{(ep/s_{0})}\log{(n/s_{0}\log{(ep/s_{0})})})\right)^{1/3}$ (similar to the rate obtained in Theorem 2.14 when $C_{n}$ is fixed), which can be further simplified to $(n/\left(s_{0}\log{p}\log{n}\right))$ under the specific choices of $s_{0},p$ taken in Remark 2.15. As in the moderate growth regime, the additional term $(c_{1}/m)^{2}$ in $r_{n}$ above is the adjustment for the growing number of utilities.

4 Computational Aspects

In this section we investigate the performance of a number of procedures employ for estimating $\beta_{0}$ in the binary choice model and compare their performances. Specifically, we consider the following three methods:

Logistic regression. 2. 2.

Support Vector Machine. 3. 3.

Homotopy path-following framework adapted in Feng et al. [2019].

We divide our simulation studies into two sections: the slowly growing regime, i.e. $p/n\rightarrow 0$ and the fast growing regime i.e. $p\gg n$ . The algorithm based on homotopy path-following framework described in Feng et al. [2019] is tailored to the scenario $p\gg n$ , hence we only compare SVM and logistic regression under the slowly growing regime, and all three methods (with the $\ell_{1}$ penalized versions of SVM (see Zhu et al. [2004]) and logistic regression) when $p\gg n$ . Our primary data generation mechanism is common to both regimes (with slight changes in the $p\gg n$ to accommodate sparsity considerations) and is described below:

Generation of the true $\beta^{0}$ : For the regime $p/n\rightarrow 0$ , each entry of $\beta^{0}$ is generated from the $\text{Unif}(1,2)$ distribution and then normalized to make its $\ell_{2}$ norm 1. For the regime $p\gg n$ , each of the $s_{0}$ active entries is generated randomly from $\text{Unif}(2,3)$ and then normalized to keep $\|\beta_{0}\|_{2}=1$ . This $\beta_{0}$ remains fixed over all monte-carlo iterations. 2. 2.

Generate $X_{1},\dots,X_{n}$ from $\mathcal{N}(0,\Sigma)$ where the dispersion matrix $\Sigma$ has the following form:

[TABLE]

where we take $\rho=0.5$ for our simulations. 3. 3.

We generate the co-variate dependent errors $\epsilon_{i}^{\prime}s$ as follows: for $1\leq i\leq n$ ,

[TABLE]

where $\sigma^{2}_{i}=1\vee|X_{i}^{\top}\beta_{0}|$ . 4. 4.

Finally we set $Y_{i}=\text{sgn}(X_{i}^{\top}\beta^{0}+\epsilon_{i})$ for all $1\leq i\leq n$ . (or one can set $Y_{i}=\mathds{1}(X_{i}^{\top}\beta^{0}+\epsilon_{i}\geq 0)$ depending on how one wants to encode the binary variable).

The idea behind this model is that, the data close to the boundary are more informative than the data far away. Note also, that when the variability of the error near the boundary is low, estimation is a relatively easy task. Hence, to challenge the existing methods, we assume non-negligible error variance near the boundary. For the simulation setting above, for a point $X$ near the boundary, i.e. $X^{\top}\beta^{0}\approx 0$ , the (conditional) variance of the error is $\approx 1$ , and as one moves to points away from the boundary, the (conditional) variability of the response increases depending on their distance from the true hyperplane.

4.1 Estimation error $p=o(n)$

We explore three different growth patterns of $p$ relative to $n$ :

[TABLE]

where $\lfloor x\rfloor$ is the floor function. The sample size $n$ ranges as: $n=12000,14000,16000,18000,20000$ .

Consider first the performance of SVM on the generated data. Below are three density plots of scaled estimation error $(n/p)^{1/3}\|\hat{\beta}-\beta_{0}\|_{2}$ based on 500 monte-carlo iterations:

As is evident from the density plots above, the distribution of the normalized errors is quite stable across the different values of $n$ suggesting that the SVM method is giving a quite decent approximation to the actual score estimator. We next apply simple logistic regression for estimating $\beta^{0}$ . As we except, this does not perform as well as SVM owing to model mis-specification. The reason we study logistic regression is because it is typically the bread and butter option for dealing with binary response regression, but as we see below is quite suspect in this situation. Below are the plots from logistic regression:

It is quite clear from the plots that the scaled error is not converging with $n$ and behaves in a rather erratic manner, and the SVM algorithm is markedly superior. Further investigation into SVM based algorithms in this and similar models can constitute a potentially interesting topic for future research.

4.2 Model selection and estimation when $p\gg n$

As mentioned in the Introduction, our problem can be viewed as a binary classification problem with a linear Bayes’ classifier. Under the sparsity assumption, only a few covariates contribute to the classification. To identify these covariates, we resort to a penalized classification approach. We here employ three methods:

$\ell_{1}$ penalized SVM. 2. 2.

$\ell_{1}$ penalized logistic regression. 3. 3.

Homotopy path-following framework adapted in Feng et al. [2019].

Recall that the data generating mechanism has already been described. We take $n=2000,p=10000$ for our simulations and use five different values of $s_{0}=10,20,30,40,50$ . Our goal is to investigate how the performance of the classifier changes as we increase $s_{0}$ keeping $n$ and $p$ fixed. The penalty parameters for logistic regression and SVM are selected using grid search and two-fold cross-validation. We also implement the algorithm based on homotopy path-following framework adapted in Feng et al. [2019]. We assess the performances of these three approaches based on the following discrepancy measures:

Misclassification error. 2. 2.

Norm difference between $\hat{\beta}$ and $\beta^{0}$ . 3. 3.

No. of true active variables not selected (denoted by Type 2 error). 4. 4.

No. of true null variables selected (denoted by Type 1 error).

The following plots provide a visual representation of the comparisons for different values of sparsity and across the three methods.

It is clear that logistic regression is generally outperfomed by some other method in this model. The performance of SVM and the algorithm proposed in Feng et al. [2019] are generally at par, though from eye-inspection, the later seems superior. For example, in terms of mis-classification error and norm-difference, their performance is similar; in some cases algorithm of Feng et al. [2019] performs better than SVM, while in other cases SVM wins marginally. Type 2 error (proportion of true active variables missed by the method) is generally lower for the algorithm in Feng et al. [2019] when compared to SVM, whilst Type 1 error (proportion of null variables declared active) is generally higher: SVM is more conservative in terms of selecting variables.

Depending on one’s priorities, one may weigh Type 1 and Type 2 errors differently to generate a weighted misclassification error. In the absence of any such information, it is natural to assign equal weights, which leads to the sum of thes two errors, as shown in the following plot:

We find that the algorithm proposed in Feng et al. [2019] better under this metric especially for large $s_{0}$ which is explained by tendency of SVM not selecting enough active variables.

Based on our study, it appears that one is better off with the algorithm proposed in Feng et al. [2019] in the $p\gg n$ scenario, though, of course much larger scale simulations would be necessary to make any general recommendations. As the focus of our paper is largely theoretical, we do not develop these studies any further but note that a thorough investigation of computationally feasible methods in this and related problems involving optimization of discontinuous functions along with analytical assessments of their performance constitutes an open direction of research.

5 Concluding Discussion

We close with a discussion of various aspects of the high-dimensional binary choice model and our approach to the problem.

5.1 Exploring and relaxing our assumptions:

It is of interest to investigate sufficient conditions under which Assumption 2.3 and Assumption 2.8 hold. We show in Lemma B.1 in the supplement that these two assumptions hold simultaneously when $X$ arises from an elliptically symmetric distribution centered at 0, under some restrictions on the minimum and maximum eigenvalues of its orientation matrix. Assumption 2.8 also holds for elliptically symmetric distributions centered at 0 but under some further mild conditions, as demonstrated in Lemma B.2.

5.2 Model with intercept:

Our treatment thus far has considered a model of the form $Y^{\star}=X^{\top}\beta_{0}+\epsilon$ for a random $X$ . However, many practical scenarios necessitate the inclusion of an intercept term where the term $X^{\top}\beta_{0}$ is replaced by $(1,X^{\top})(\tau_{0},(\beta^{0})^{\top})^{\top}$ . Assumption 2.3 then naturally generalizes to

[TABLE]

However, we cannot expect this to be satisfied for all $(\tau,\beta)$ with $\|\beta\|=1$ when $\tau$ varies in an unconstrained manner. Consider for example the case that $X\sim N(0,I_{p}))$ so that $X^{\top}\beta$ and $X^{\top}\beta^{0}$ are both standard normal. In this case, when $\tau$ and $\tau^{0}$ are very large, the signs of $\tau^{0}+X^{\top}\beta^{0}$ and $\tau+X^{\top}\beta$ are primarily driven by the magnitudes of $\tau$ and $\tau^{0}$ , so that if these two parameters have sign, the probability of the signs being different can be made as small as one pleases depending on the magnitudes of the $\tau$ ’s. This entails controlling the magnitudes of the $\tau$ ’s relative to the $\beta$ ’s; in particular, if the absolute magnitudes of the $\tau$ ’s are kept bounded away from $\infty$ , a restricted version of Assumption 2.3, in the sense that the inequality in Assumption 2.3 is fulfilled for all $\beta$ sufficiently close to $\beta_{0}$ , is, indeed, verifiable for certain families of distributions including elliptically symmetric $X$ centered at the origin, as well as $X$ ’s with independent components where each component has a symmetric log-concave density with mode at 0. The $\ell_{2}$ convergence and minimax lower bound results established in this paper still continue to hold, but to accommodate the restricted version of this assumption, the proofs presented in the supplement need to be slightly modified. An elaborate and rigorous discussion of such models with intercept is available in section C of the supplement.

5.3 Asymptotic distribution

In their seminal paper, Kim and Pollard Kim et al. [1990] proved that for fixed $p$ , $n^{1/3}(\hat{\beta}-\beta^{0})$ converges in distribution to the maximizer of a Gaussian process with quadratic drift. Our treatment of the binary choice model should be contrasted with their approach: while they assumed the continuous differentiability of both the density of $X$ and $\eta(x)=P(Y=1|X=x)$ and a compact support for $X$ , we have made no such assumptions. We have tackled those aspects of this problem from the classification point of view, with assumptions on the growth of $P(Y=1|X=x)$ near the Bayes hyperplane and in addition, conditions on the distribution of $X$ to ensure that sufficiently many observations are available around the Bayes hyperplane. As far as the asymptotic distribution of the score estimator in growing dimensions (or functionals thereof) is concerned, this is, in itself, a mathematically formidable problem, well outside the scope of this paper. Based on what we know in the fixed $p$ setting, the forms of such distributions are likely to be extremely complicated. The question remains whether tractable asymptotic distributions for making inference on components of $\beta^{0}$ in the growing $p$ setting could be obtained for smoothed versions of the score estimator, in the spirit of Horowitz’s paper Horowitz [1992]. This is likely to be an interesting but challenging avenue for future research on this subject.

6 Selected Proofs

6.1 Proof of Theorem 2.9

We generate $a_{n}(8n/p)^{\frac{p-1}{3}}$ points uniformly from the surface of the sphere (where $a_{n}\uparrow\infty$ will be chosen later), maximize the empirical score function $S_{n}(\beta)$ over these selected points and show that the maximizer achieves the desired rate. Define $T_{n}=a_{n}(8n/p)^{(p-1)/3}$ and $E_{n}$ to be the collection of $T_{n}$ points generated uniformly.

We start with the following technical lemma that plays a key role in the proof.

Lemma 6.1.

Suppose $D(x,r)$ denotes a spherical cap around $x$ of radius $r$ , i.e.

[TABLE]

Then we have

[TABLE]

for $0\leq r\leq 1$ and $p\geq 8$ , where $\sigma$ is the uniform measure on the sphere, i.e. the proportion of the surface of the spherical cap to the surface area of the sphere.

For a brief discussion on this Lemma, see section B.11. The next lemma shows that we can find at least one point in our collection which is within a distance of $(p/n)^{1/3}$ of $\beta_{0}$ with probability $\uparrow 1$ .

Lemma 6.2.

Let $\Omega_{-1,n}$ denote the event that there exists at least one $\beta^{\prime}\in E_{n}$ such that $\|\beta^{\prime}-\beta^{0}\|_{2}\leq(p/n)^{1/3}$ . Then $P(\Omega_{-1,n})\rightarrow 1$ .

Proof.

Using Lemma 6.1 we have the following bound:

[TABLE]

∎

Let $\tilde{\beta}$ denote the point closest to $\beta^{0}$ . On $\Omega_{-1,n}$ , $\|\tilde{\beta}-\beta^{0}\|\leq(p/n)^{1/3}$ . To establish the convergence rate, we will use a specific version of the shelling argument. Fix $T>0$ , sufficiently large. (In fact, as we work our way through the proof we will keep enhancing the value of $T$ as and when necessary, but as this will be done finitely many times, it won’t have a bearing on the rate of convergence.) Consider shells $C_{i}$ around the true parameter $\beta^{0}$ , where

[TABLE]

with $r_{i}=T(p/n)^{1/3}2^{i}$ , for $i=0,1,..A_{n}$ and $A_{n}\overset{\Delta}{=}\frac{1}{3}\log_{2}{(n/p)}-\log_{2}{T}$ . We will compute an upper bound on the number of elements of $B_{i}=E_{n}\cap C_{i}$ for all $i\in\{0,1,\cdots,A_{n}\}$ .

Lemma 6.3.

For all $i\in\{0,1,\cdots,A_{n}\}$ ,

[TABLE]

with exponentially high probability where $p_{i}=\sigma(D(\beta^{0},r_{i}))$ .

Proof.

Let $N_{i}$ denote the number of points in $E_{n}\cap B_{i}$ . Then $N_{i}\sim\text{Bin}(T_{n},p_{i})$ where $p_{i}=\sigma(D(\beta^{0},r_{i}))$ . For $i=A_{n}$ , $p_{i}=1$ . So $P(N_{i}>2T_{n}p_{i})=0$ . Hence we will only confine ourselves to the case $i\in\{0,1,\cdots,A_{n}-1\}$ . In this case, $r_{i}\leq 1$ and hence from Lemma 6.1 we have $p_{i}\leq\frac{1}{2\sqrt{2}}<\frac{1}{2}$ . From the Chernoff tail bound for the Binomial distribution we have, for each $i$ : $P(N_{i}>2T_{n}p_{i})\leq{\sf exp}(-T_{n}D(2p_{i}||p_{i}))$ , where

[TABLE]

is the Kullback-Liebler divergence between Bernoulli $(p_{i})$ and Bernoulli $(2p_{i})$ . This can be lower bounded thus:

[TABLE]

Using this upper bound we have:

[TABLE]

∎

Define $\Omega_{i,n}={N_{i}\leq 2T_{n}p_{i}}$ for $i=\{0,1,\cdots,A_{n}-1\}$ and let $\Omega_{n}=\cap_{i=-1}^{A_{n}-1}\Omega_{i,n}$ . The following lemma says that the event $\Omega_{n}$ happens with high probability:

Lemma 6.4.

For any $T>1$ , ${\mathbb{P}}(\Omega_{n})\rightarrow 1$ as $n\rightarrow\infty$ .

Proof.

It is enough to show that $\sum_{i=-1}^{A_{n}-1}{\mathbb{P}}(\Omega_{i,n}^{c})\rightarrow 0$ as $n\rightarrow\infty$ . We have already established in Lemma 6.2 that ${\mathbb{P}}(\Omega_{-1,n}^{c})\rightarrow 0$ as $n\rightarrow\infty$ . Using Lemma 6.3:

[TABLE]

Now for any fixed $n$ , the maximum term obtains when $i=0$ i.e. $e^{\left(-a_{n}\frac{1}{2}(T)^{p-1}(\log{4}-1)\right)}$ which goes to 0 for $T>1$ . Furthermore, the series under consideration is easily dominated by $\sum_{i=1}^{\infty}\,e^{-k2^{i}}$ for some constant $k>0$ , which is clearly finite. Hence the series on the right-side of the above display goes to 0 with increasing $n$ . ∎

The rest of the analysis will be done conditioning on the event $\Omega_{n}$ . Define ${\mathbb{P}}_{n}(A)={\mathbb{P}}(A\ |\ \Omega_{n})$ . Then we have:

[TABLE]

Since ${\mathbb{P}}(\Omega_{n}^{c})\rightarrow 0$ as $n\rightarrow\infty$ , we omit this term henceforth. Next, we analyze a general summand. Define $Z_{i}(\beta)=Y_{i}\text{sgn}(X_{i}^{\top}\beta)-Y_{i}\text{sgn}(X_{i}^{\top}\tilde{\beta})$ . Then $\frac{1}{n}\sum_{i=1}^{n}Z_{i}(\beta)=S_{n}(\beta)-S_{n}(\tilde{\beta})$ and $Z_{i}(\beta)$ assumes values $\{-2,0,2\}$ . Also, ${\mathbb{E}}(Z_{i}(\beta))=S(\beta)-S(\tilde{\beta})$ . Using Proposition 2.4 and Assumption 2.8 we have:

[TABLE]

for $T>\sqrt{(2u_{+})/u_{-}}$ . This implies $Z_{i}(\beta)$ has high probability of being negative. We exploit this to prove the concentration. To simplify the calculations, define $\{Y_{i}(\beta)\}_{i=1}^{n}$ be to be a collection of independent random variables with

[TABLE]

Hence the expectation of $Y_{i}(\beta)$ is:

[TABLE]

For the rest of the calculations we need to bound ${\mathbb{P}}(\text{sgn}(X^{\top}\beta)\neq\text{sgn}(X^{\top}\tilde{\beta}))$ . Towards that direction we have the following:

[TABLE]

for $T>1$ . For the lower bound we have:

[TABLE]

when $T>2C^{\prime}/c^{\prime}$ , where we are also using the fact that $\beta\in B_{0}^{c}$ . Putting the upper bound in equation (6.1) we have:

[TABLE]

So, if $P(Y_{i}(\beta)=2)=P(Z_{i}(\beta)=2\ |\ Z_{i}(\beta)\neq 0)=p_{2}$ (say) then $4p_{2}-2\leq(-u_{-}/4C^{\prime})\|\beta-\beta^{0}\|_{2}$ which implies $p_{2}\leq\frac{1}{2}-(-u_{-}/16C^{\prime})\|\beta-\beta^{0}\|_{2}$ . Define $W_{i}(\beta)=\frac{Y_{i}(\beta)+2}{4}$ . Then $W_{i}(\beta)\sim\text{Ber}(p_{2})$ . Let $N$ denote the number of non-zero $Z_{i}(\beta)$ ’s. Then $N\sim\text{Bin}(n,p_{1})$ where $p_{1}={\mathbb{P}}(\text{sgn}(X^{\top}\beta)\neq\text{sgn}(X^{\top}\tilde{\beta}))$ .

[TABLE]

Thus we can take $a_{n}=p$ to ignore the effect of $(\log{a_{n}}/p)$ . Putting this back in equation (1) we get:

[TABLE]

for large enough $T$ (by using similar arguments to the one used for handling the earlier series) which proves the theorem.

6.2 Proof of Theorem 2.18

We use Fano’s inequality along with the Gilbert-Varshamov Lemma to prove the minimax lower bound. Fano’s inequality (or Local-Fano’s inequality) gives us a lower bound on the minimax risk as follows: If $\Theta^{\prime}\subseteq\Theta$ is a finite $2\epsilon$ packing set, i.e. for any two $\theta_{i},\theta_{j}\in\Theta^{\prime}$ , $\|\theta_{i}-\theta_{j}\|_{2}\geq 2\epsilon$ with $|\Theta^{\prime}|=M$ , then, based on $n$ i.i.d. samples $z_{1},z_{2},\dots,z_{n}\sim P_{\theta}$ we have the following minimax lower bound:

[TABLE]

The crux of the proof relies on constructing competing models that approach each other at an optimal rate, as $n$ increases. We start with a preliminary lemma.

Lemma 6.5.

If $P\sim Ber(p_{1})$ and $Q\sim Ber(q_{1})$ and if $\frac{1}{4}\leq q_{1}\leq\frac{3}{4}$ , then $KL(P||Q)\leq\frac{16}{3}(p-q)^{2}$

The proof of the Lemma appears in supplement in section B.13. We next state the Gilbert-Varshamov Lemma for convenience (see Raskutti et al. [2011] and references therein), that guides the construction of $\Theta^{{}^{\prime}}$ in our problem.

Lemma 6.6 (Gilbert-Varshamov).

Define $d_{H}$ to be the Hamming distance, i.e. $d_{H}(x,y)=\sum_{i=1}^{d}\mathds{1}_{x_{i}\neq y_{i}}$ with $d$ being the underlying dimension. Given any $s$ with $1\leq s\leq\frac{d}{8}$ , we can find $w_{1},\cdots,w_{M}\in\{0,1\}^{d}$ such that:

a)

$d_{H}(w_{i},w_{j})\geq\frac{s}{2}\,\,\forall\,\,i\neq j\in\{1,2,\cdots,M\}$ . 2. b)

$\log{M}\geq\frac{s}{8}\log{\left(1+\frac{d}{2s}\right)}$ ** 3. c)

$\|w_{j}\|_{0}=s\ \forall j\ \in\{1,2,\cdots,M\}$ .

Fix $0<\delta<1/4$ . To construct a $2\epsilon$ packing set $\left(\epsilon=\frac{\delta}{4}\right)$ of $S^{p-1}$ , consider the following vectors:

[TABLE]

where $w_{J}\in W$ , a subset of $\{0,1\}^{p-1}$ constructed using GV lemma. Let $\Theta^{\prime}=\{\beta_{J}:J\in\{1,2,\cdots,M\}\}\subseteq\Theta=S^{p-1}$ . For $I\neq J$ ,

[TABLE]

For notational simplicity define $m(\delta)=\sqrt{1+\delta^{2}}$ and $C=C_{n}$ . Fix $\alpha\geq 1$ . Denote ${\mathbb{P}}_{\beta_{J}}(X,Y)$ as the joint distribution of $(X,Y)$ where $X\sim\mathcal{N}(0,I_{p})$ and

[TABLE]

for all $J\in\{1,2,\cdots,M\}$ . Now, for any $\beta_{J}\in\Theta^{\prime}$ , we have $\beta_{J}^{\mathsf{T}}X=\frac{X_{1}}{m(\delta)}+\tilde{\beta}_{J}^{\mathsf{T}}\tilde{X}$ where $\tilde{a}=(a_{2},a_{3},\cdots.a_{p})$ . As $\|\tilde{\beta}_{J}\|_{0}=s$ by construction, we know $\tilde{\beta}_{J}^{\mathsf{T}}\tilde{X}=\frac{\delta Z_{J}}{m(\delta)}$ where $Z_{J}\sim\mathcal{N}(0,1)$ and independent of $X_{1}$ . Thus, we have $\beta_{J}^{\mathsf{T}}X=\frac{X_{1}+\delta Z_{J}}{m(\delta)}$ .

Lemma 6.7.

The above family of distributions satisfy the margin assumption (Assumption 2.2) for all $0<t<1/4$ .

Proof.

Fix any $0<t<\frac{1}{4}$ .

[TABLE]

∎

Define the event $A_{I}=\left\{|\beta_{I}^{\mathsf{T}}X|\leq\left(C\delta\vee\frac{|X_{1}|}{2m(\delta)}\right)\wedge\frac{1}{4}\right\}$ . Then we have the following lemma:

Lemma 6.8.

If $X\in A_{i}\cup A_{j}$ , then

[TABLE]

The proof follows the same arguments as that of Lemma B.7 and is skipped. Next, we upper-bound the KL divergence:

Lemma 6.9.

For any $I\neq J\in\{1,2,\cdots,M\}$ , we have

[TABLE]

where $\phi$ is the standard normal density.

Proof.

[TABLE]

We analyze each summand separately, starting with $S_{1}$ .

[TABLE]

Now, on to $S_{2}$ :

[TABLE]

Combining equations 6.2, 6.3 and 6.4 we conclude that:

[TABLE]

∎

The final step is a direct application of Fano’s inequality. According to our construction, $\Theta^{\prime}$ is a $2\epsilon$ packing set with $\epsilon=\frac{\delta}{4}$ . For notational simplicity, set

[TABLE]

The upper bound on the KL divergences, in conjunction with Fano’s inequality, gives:

[TABLE]

Taking $\delta=\left(\frac{\frac{s}{64}\log{\frac{p}{s}}}{nU_{c}}\right)^{\frac{1}{3}}C^{\frac{2}{3}}$ , then we have:

[TABLE]

the last inequality holding true when $\log{2}\leq\frac{s}{128}\log{\frac{p}{s}}$ , which is true for all large $s,p$ as $s\log{\frac{p}{s}}\rightarrow\infty$ .

The other inequality (i.e. we cannot estimate at a better rate than $(s_{0}\log{p/s_{0}}/n)$ ), essentially follows from the same argument with taking $C_{n}=0$ . We skip the details here for the sake of brevity. $\Box$

Appendix A Some important results

In this section we state some results from the existing literature for the convenience of the readers which we use in our proofs. Theorem A.1 is Theorem 2 of Massart et al. [2006] which provides some exponential concentration bound on the ERM estimators for bounded loss functions. Lemma A.2 is a classical maximal inequality, which is used to bound the fluctuations of an empirical process. A simple proof of this Lemma can be found in Massart et al. [2006]. Theorem A.3 is a modified version of Theorem 8.5 of Massart , which we use for our model selection consistency results in case of $p\gg n$ . We provide the proof of Theorem A.3 in this supplement. Theorem A.4 is a version of Talagrand’s inequality (also known as Bousquet’s version of Talagrand inequality, see Bousquet [2002]) which we use to prove Theorem A.3.

Theorem A.1.

Let $\{Z_{i}=(X_{i},Y_{i})\}_{i=1}^{n}$ be i.i.d. observations taking values in the sample space $\mathcal{Z}:\mathcal{X}\times\mathcal{Y}$ and let $\mathcal{F}$ be a class of real-valued functions defined on $\mathcal{X}$ . Let $\gamma:\mathcal{F}\times\mathcal{Z}\rightarrow[0,1]$ be a loss function, and suppose that $f^{*}\in\mathcal{F}$ uniquely minimizes the expected loss function $P(\gamma(f,.))$ over $\mathcal{F}$ . Define the empirical risk as $\gamma_{n}(f)=(1/n)\sum_{i=1}^{n}\gamma(f,Z_{i})$ , and $\bar{\gamma}_{n}(f)=\gamma_{n}(f)-P(\gamma(f,.))$ . Let $l(f^{*},f)=P(\gamma(f,.))-P(\gamma(f^{*},.))$ be the excess risk. Consider a pseudo-distance $d$ on $\mathcal{F}\times\mathcal{F}$ satisfying $Var_{P}[\gamma(f,.)-\gamma(g,.)]\leq d^{2}(f,g)$ . Finally, let $C_{1}$ be the collection of all functions $\{h:\mathbb{R}^{+}\rightarrow\mathbb{R}^{+}\}$ such that, $h$ is non-decreasing, continuous with $h(x)/x$ is non-increasing on $[0,\infty)$ and $h(1)\geq 1$ . Assume that:

(1)

There exists $F\subseteq\mathcal{F}$ and a countable subset $F^{\prime}\subseteq F$ , such that for each $f\in F$ , there is a sequence $\{f_{k}\}$ of elements of $F^{\prime}$ satisfying $\gamma(f_{k},z)\rightarrow\gamma(f,z)$ as $k\rightarrow\infty$ , for every $z\in\mathcal{Z}$ . 2. (2)

$d(f,f^{*})\leq\omega\left(\sqrt{l(f^{*},f)}\right)\ \forall\ f\in\mathcal{F}$ , for some function $\omega\in C_{1}$ . 3. (3)

For every $f\in F^{\prime}$

[TABLE]

for every $\sigma>0$ such that $\phi(\sigma)\leq\sqrt{n}\sigma^{2}$ , where $\phi\in C_{1}$ .

Let $\epsilon_{*}$ be the unique positive solution of $\sqrt{n}\epsilon_{*}^{2}=\phi(\omega(\epsilon_{*}))$ . Let $\hat{f}$ be the (empirical) minimizer of $\gamma_{n}$ over $F$ and $l(f^{*},F)=\inf_{f\in F}l(f^{*},f)$ .Then, there exists an absolute constant $K$ such that for all $y\geq 1$ , the following inequality holds:

[TABLE]

Lemma A.2 (A maximal inequality for weighted empirical process).

Let $S$ be a countable set, $u\in S$ and $a:S\rightarrow\mathbb{R}_{+}$ such that $a(u)=\inf_{t\in S}a(t)$ . Let $Z$ be a process indexed by $S$ and assume that the non-negative random variable $\sup_{t\in\mathcal{B}(\epsilon)}[Z(u)-Z(t)]$ has finite expectation for any positive number $\epsilon$ , where $\mathcal{B}(\epsilon)=\{t\in S,\ a(t)\leq\epsilon\}$ . Let $\psi$ be a non-negative function on $\mathbf{R}_{+}$ such that $\psi(x)/x$ is non-increasing on $\mathbf{R}_{+}$ and satisfies for some positive number $\epsilon_{*}$ :

[TABLE]

Then, one has, for any positive number $x\geq\epsilon_{*}$ ,

[TABLE]

Theorem A.3 (Model selection consistency).

Let $\xi_{1},\dots,\xi_{n}$ be independent observations taking their values in the measurable space $\Xi$ with common distribution $P$ . Let $\mathcal{S}$ be some set, $\gamma:\mathcal{S}\times\Xi\rightarrow[0,1]$ , be a measurable function such that for every $t\in\mathcal{S}$ , $x\rightarrow\gamma(t,x)$ is measurable. Assume that there exists some minimizer $s$ of $P(\gamma(t,\cdot))$ over $\mathcal{S}$ and define $\ell(s,t)$ as the excess risk:

[TABLE]

for every $t\in\mathcal{S}$ . Let $\gamma_{n}$ be the empirical risk:

[TABLE]

and $\bar{\gamma}_{n}$ be corresponding centered empirical process defined by

[TABLE]

Let $d$ be some psuedo-distance on $\mathcal{S}\times\mathcal{S}$ such that

[TABLE]

Let $\{S_{m}\}_{m\in\mathcal{M}}$ be some, at most, countable collection of subsets of $\mathcal{S}$ , each model $S_{m}$ admitting some countable subset $S^{\prime}_{m}$ such that for every $t\in S_{m}$ , there exists some sequence $\{t_{k}\}_{k\geq 1}$ of elements of $S^{\prime}_{m}$ satisfying $\gamma(t_{k},\xi)\to\gamma(t,\xi)$ as $k\to\infty$ , for every $\xi\in\Xi$ . Let $\omega$ and $\phi_{m}$ belong to class of functions $C_{1}$ (defined in Theorem A.1) for all $m\in\mathcal{M}$ . Assume one hand

[TABLE]

and on the other hand one has for every $m\in\mathcal{M}$ and $u\in S^{\prime}_{m}$ :

[TABLE]

for every positive $\sigma$ such that $\phi_{m}(\sigma)\leq\sqrt{n}\sigma^{2}$ . Let $\epsilon_{m}$ be the unique solution of the equation:

[TABLE]

with $\epsilon_{m}\leq 1\ \forall\ m\in\mathcal{M}$ . Let $\hat{s}_{m}\in S_{m}$ be the empirical minimizer:

[TABLE]

and $\{x_{m}\}_{m\in\mathcal{M}}$ be some family of nonnegative weights such that

[TABLE]

Consider a penalty function pen: $\mathcal{M}\rightarrow R_{+}$ such that for every $m\in\mathcal{M}$ ,

[TABLE]

for some judiciously chosen constant $K$ . Define the chosen model as $\hat{m}$ , i.e.:

[TABLE]

Also, define $m_{(1)}=\arg\min_{m\in\mathcal{M}}\epsilon_{m}$ and $b(n)=\omega^{2}\left(\epsilon_{m_{(1)}}\right)/\epsilon^{2}_{m_{(1)}}$ . Then the penalized estimator $\tilde{s}=\hat{s}_{\hat{m}}$ satisfies the following inequality:

[TABLE]

where the constants $C,C_{1}$ depend on $K$ . This immediately implies:

[TABLE]

for some constant $C_{2}$ depending on $C,C_{1}$ .

Theorem A.4 (Bousquet’s version of Talagrand inequality).

Let $\mathcal{F}$ be a countable family of measurable functions such that for some positive constants $v,b$ one has for all $f\in\mathcal{F}$ , $\mbox{Var}_{P}(f)\leq v$ and $\|f\|_{\infty}\leq\mathsf{b}$ . Then for all $y\geq 0$ :

[TABLE]

where $Z=\sup_{f\in\mathcal{F}}\left({\mathbb{P}}_{n}-P\right)f$ .

Remark A.5.

The result above extends to an uncountable family $\mathcal{F}$ if there exists a countable $\mathcal{F}^{\prime}\subset\mathcal{F}$ with the property that for every $\tilde{f}\in\mathcal{F}$ , there is a sequence $\{\tilde{f}_{j}\}$ belonging to $\mathcal{F}^{\prime}$ such that $\tilde{f}_{j}(\cdot)\rightarrow\tilde{f}(\cdot)$ pointwise. This is indeed the case for all applications of this result in our paper.

Appendix B Proofs of Theorems and Lemmas

B.0.1 Proof of Proposition 2.4

To prove Proposition 2.4 at first we relate the excess risk $S(\beta^{0})-S(\beta)$ to $d_{\Delta}(\beta,\beta^{0})$ . Define for notational simplicity:

[TABLE]

for $\beta\in S^{p-1}$ . We have,

[TABLE]

A straightforward derivative calculation implies that the suprema is attained at $d_{\Delta}(\beta,\beta^{0})/2C_{n}$ if $d_{\Delta}(\beta,\beta^{0})<2t^{*}C_{n}$ and at $t^{*}\left(d_{\Delta}(\beta,\beta^{0})-C_{n}t^{*}\right)$ otherwise. Hence we conclude:

[TABLE]

Combining this with Assumption 2.3 we conclude:

[TABLE]

which completes the proof.

B.1 Some sufficient conditions for Assumptions 2.3 and 2.8

In this subsection we provide some sufficient conditions for Assumption 2.3 and 2.8. We break the analysis into two lemmas. Lemma B.1 below exerts some sufficient conditions for Assumption 2.3 and part (i) of Assumption 2.8. Lemma B.2 yields sufficient conditions for part (ii) of Assumption 2.8.

Lemma B.1.

Suppose that $X_{p\times 1}$ follows an elliptically symmetric distribution centered at 0, with density $f_{X}(x)=|\Sigma_{p}|^{-1/2}g(x^{T}\Sigma_{p}^{-1}x)$ , where $g$ is a non-negative function. Assume that:

[TABLE]

where $c_{\lambda}$ does not depend on $n,p$ . Then $X$ satisfies Assumption 2.3 and part (i) of Assumption 2.8.

Proof.

First, we prove that for $X\sim\mathcal{N}(0,\Sigma_{p})$ with the above displayed condition holding. Observe that ${\mathbb{P}}_{X}(\text{sgn}(X^{\mathsf{T}}\beta)\neq\text{sgn}(X^{\mathsf{T}}\beta^{0}))$ depends on the two-dimensional geometry of $X$ , i.e. only on the distribution of $(X^{\mathsf{T}}\beta,X^{\mathsf{T}}\beta^{0})$ . To make the calculations easier, we transform $X$ into $Y$ where the first two-coordinates of $Y$ corresponds to $(X^{\mathsf{T}}\beta,X^{\mathsf{T}}\beta^{0})$ . Consider the following orthogonal matrix:

[TABLE]

where $\beta^{0^{\prime}},\frac{\beta^{\prime}-\langle\beta^{0},\beta\rangle\beta^{0^{\prime}}}{\sqrt{1-\langle\beta^{0},\beta\rangle^{2}}},v_{3},\dots,v_{p}$ forms an orthonormal basis of $\mathbb{R}^{p}$ (For example the vectors $v_{3},\dots,v_{p}$ can be constructed using the Gram-Schimdt algorithm). If we define $Y=PX$ , then $Y_{1}=X^{\mathsf{T}}\beta^{0}$ and $X^{\mathsf{T}}\beta=a_{1}Y_{1}+a_{2}Y_{2}$ where $a_{1}=\langle\beta^{0},\beta\rangle,a_{2}=\sqrt{1-\langle\beta,\beta^{0}\rangle^{2}}$ . Then the probability of the wedge shaped region becomes:

[TABLE]

Now, for $||\beta-\beta^{0}||=\delta,a_{1}=\langle\beta,\beta^{0}\rangle=1-\frac{\delta^{2}}{2}$ . Hence we get,

[TABLE]

Using this we obtain,

[TABLE]

It can be easily seen (i.e. by differentiating) that the function $\frac{\tan^{-1}\left[-\frac{1-\frac{\delta^{2}}{2}}{\delta\sqrt{1-\frac{\delta^{2}}{4}}}\right]+\frac{\pi}{2}}{\delta}$ is an increasing function of ${\delta}$ for $0\leq\delta\leq 2$ . More precisely, observing that

[TABLE]

we conclude, for $0\leq\delta\leq 2$ :

[TABLE]

In conjunction with B.1, this gives:

[TABLE]

Finally using the fact that

[TABLE]

we have $\sqrt{\frac{\lambda_{2}}{\lambda_{1}}}\geq\sqrt{c_{\lambda}}$ . Combining these, we conclude:

[TABLE]

Now, on to general $X$ . By our assumption on $X$ in the statement of the lemma, $X\sim\mathcal{E}(0,\Sigma_{p})$ i.e. $X=\Sigma_{p}^{1/2}Y$ for some spherically symmetric random variable $Y$ . We know

[TABLE]

for some $g:\mathbb{R}^{+}\rightarrow\mathbb{R}^{+}$ with $Z\sim\mathcal{N}(0,I_{p})$ . Using the relation we have:

[TABLE]

which again falls back to $\mathcal{N}(0,\Sigma_{p})$ situation. The upper bound can be established via a similar calculation, where we need a finite upper bound on $\sup_{p}\frac{\lambda_{max}(\Sigma_{p})}{\lambda_{min}(\Sigma_{p})}$ : this is given by $\frac{1}{c_{\lambda}}$ . ∎

Lemma B.2.

Assume that the function $\eta(x)$ satisfies that

[TABLE]

for some constant $k$ a.e. with respect to the measure of $X$ and the distribution of $X$ follows a consistent family of elliptical distribution with $f_{X}(x)=|\Sigma_{p}|^{-1/2}g_{p}(x^{T}\Sigma_{p}^{-1}x)$ . Also assume that $g_{2}$ (the density component corresponding to the two dimensional marginal of $X$ ) is a decreasing function on $R$ and the eigenvalues of orientation matrix $\Sigma_{p}$ satisfies:

[TABLE]

for all $p$ . Then, under part (i) of Assumption 2.8 we have

[TABLE]

for all $\beta\in S^{p-1}$ where $u_{+}=4\pi kk_{1}\frac{\lambda_{+}}{\sqrt{\lambda_{-}}}$ for some constant $k_{1}$ defined in the proof.

Proof.

As in the proof of proposition 2.4 we have (with the same notation):

[TABLE]

∎

Remark B.3.

The Lipschitz type condition i.e. $|\eta(x)-1/2|\leq k|x^{T}\beta^{0}|$ controls how the function varies around the true hyperplane. This condition is easily satisfied if we assume that the conditional density of $\epsilon$ given $X$ has an uniform upper bound over all $x$ and dimension. Note that the two conditions in the above lemma and Assumption 2.8 are readily satisfied, for example, for a broad class of elliptically symmetric densities centered at 0.

B.2 Proof of Theorem 2.6

In this proof, $K$ will denote a generic constant (not depending on $(n,C_{n},p)$ ) which may change from line to line. We use Theorem A.1 to establish the rate of convergence of the maximum score estimator. In our problem, the set of classifiers

[TABLE]

and $\mathcal{Z}=\mathbb{R}^{p}\times\{-1,1\}$ . We define the following affine transformations of our score functions:

$\gamma(f_{\beta},(X,Y))=(1-Y\text{sgn}(X^{T}\beta))/2$ . 2. 2.

$\gamma_{n}(f_{\beta})=(1-S_{n}(\beta))/2$ . 3. 3.

$P(\gamma(f_{\beta},.))=(1-S(\beta))/2$ . 4. 4.

$\bar{\gamma}_{n}(f_{\beta})=-(S_{n}(\beta)-S(\beta))/2$ .

Also, note that $f^{*}$ in Theorem A.1 is $f_{\beta^{0}}$ in our situation and the excess risk is $l(f_{\beta^{0}},f_{\beta})=(S(\beta^{0})-S(\beta))/2$ . Next we argue that the assumptions of Theorem A.1 hold in our situation. For the first assumption, take $\mathcal{F}=F$ and take $F^{\prime}=\{f_{\beta}\in\mathcal{F}:\beta\in S_{1}\}$ where $S_{1}$ is a countable dense subset of $S^{p-1}$ . It is easy to check that the convergence criterion in condition (1) of Theorem A.1 is satisfied on the set $\mathcal{X}_{0}\times\{-1,1\}$ where $\mathcal{X}_{0}$ is the set of all $x$ such that $\beta^{T}x\neq 0$ for all $\beta\in S_{1}$ . Since the random variable $X$ is continuous and $S_{1}$ is countable, $\mathcal{X}_{0}$ has probability 1, and this is sufficient for the conclusions of the theorem to hold. Also note that the collection $\mathcal{F}$ is VC class of functions with VC dimension $V\lesssim p$ .

We apply Theorem A.1 with the distance metric $d_{\Delta}$ . From Proposition 2.4:

[TABLE]

Next, we construct a function $\omega$ which satisfies condition (2) of Theorem A.1 with respect to the distance $\sqrt{d_{\Delta}}$ . Note that we need $\omega$ to satisfy:

[TABLE]

or inverting it,

[TABLE]

Hence, from Proposition 2.4 we need $\omega$ to satisfy:

[TABLE]

which further implies:

[TABLE]

Parametrizing $\sqrt{d_{\Delta}(\beta,\beta^{0})}=t$ we have:

[TABLE]

Hence inverting:

[TABLE]

which immediately implies $\omega\in C_{1}$ as defined in Theorem A.1.

It also follows that this pseudo-distance $\sqrt{d_{\Delta}}$ provides an upper bound on the variability of the difference between the loss functions at any two $\beta_{1},\beta_{2}\in S^{p-1}$ :

[TABLE]

Finally we need to find $\phi$ which satisfies condition (3) of Theorem A.1. As $\mathcal{F}$ is a VC class of functions, we can follow the same line of argument in Section 2.4 of Massart et al. [2006]:

[TABLE]

for all $\sigma\leq 1$ . The quantity $V$ in the above display is the VC-dimension of the class of all half-spaces in $\mathbb{R}^{p}$ where $V\lesssim p$ . Solving the equation $\sqrt{n}\epsilon_{*}^{2}\geq\phi(\omega(\epsilon_{*}))$ we get:

[TABLE]

Hence we need to find $\epsilon_{*}$ such that:

[TABLE]

Solving these two inequalities and ignoring constants we get:

[TABLE]

Using the above $\epsilon_{*}$ we conclude using Theorem A.1:

[TABLE]

for all $y\geq 1$ . Here also $K$ is a different constant than before, which is now a function of some universal constant and $t^{*}$ , but it does not depend on $(n,p,s_{0})$ . Using Proposition 2.4 and equation (B.3), we get the following concentration bound:

[TABLE]

Consequently:

[TABLE]

which, along with Assumption 2.3 yields:

[TABLE]

and

[TABLE]

which can be rewritten using Assumption 2.3 as:

[TABLE]

Combining equation (B.4) and (B.5) we conclude:

[TABLE]

for all $y\geq 1$ and for some constant $K$ not depending on $n,p$ with $r_{n}=\epsilon_{*}^{-1}$ . which completes the proof of the concentration bound.

The upper bound on the expectation follows from this exponential tail bound using the following calculation:

[TABLE]

which completes the proof of minimax upper bound.

B.3 Proof of Theorem 2.11

To obtain a lower bound on the minimax error, we use Assouad’s Lemma Assouad [1983] which we state below for convenience:

Lemma B.4.

[Assouad’s Lemma]*

Let $\Omega=\{0,1\}^{m}$ (or $\{-1,1\}^{m}$ ) be the set of all binary sequences of length $m$ . Let $P_{\omega},\omega\in\Omega$ be a set of $2^{m}$ measures on some space $\{\mathcal{X},A\}$ and let the corresponding expectations be ${\mathbb{E}}_{\omega}$ . Then:*

[TABLE]

where $\hat{\omega}$ is an estimator based on $n$ i.i.d. observations $z_{1},\dots,z_{n}\sim P_{\omega}$ , $P^{n}_{\omega}$ denotes the $n$ -fold product measure of $P_{\omega}$ , $d_{H}$ is the Hamming distance and $\omega\sim\omega^{\prime}$ means $d_{H}(\omega,\omega^{\prime})=1$ .111For some discussions and applications of this lemma, see Tsybakov [2009].

To apply this lemma in our model, define for small $\epsilon>0$ :

[TABLE]

We will motivate the choice of $\epsilon$ in the later part of the proof. Observe that, $\|\gamma\|_{2}$ is same for all $\gamma\in\Theta$ and equals $\sqrt{1+(p-1)\epsilon^{2}}$ . For notational simplicity, define $m(\epsilon)=\sqrt{1+(p-1)\epsilon^{2}}$ . Now, for any $\omega\in\{-1,1\}^{p-1}$ , define $\gamma_{\omega}=(1,\epsilon\omega)$ and $\beta_{\omega}=\gamma_{\omega}/\|\gamma_{\omega}\|_{2}$ . This establishes a 1-1 correspondence between $\Omega$ and $\tilde{\Theta}$ , with $m=p-1$ . For any $\beta\in\tilde{\Theta}$ define the joint distribution $P_{\beta}$ of $(X,Y)$ as:

$X\sim\mathcal{N}(0,I_{p})$ 2. 2.

$P_{\beta}(Y=1|X)=\begin{cases}\frac{1}{2}+\frac{1}{C_{n}}\beta^{\prime}X,&\text{if }|\beta^{\prime}X|\leq\left[C_{n}\epsilon\sqrt{p}\vee\frac{|X_{1}|}{2m(\epsilon)}\right]\wedge 1/4.\\ \frac{1}{2}+\left(\left[\epsilon\sqrt{p}\vee\frac{|X_{1}|}{2C_{n}m(\epsilon)}\right]\wedge 1/4\right)\text{sgn}(\beta^{\prime}X),&\text{otherwise}.\end{cases}$

The Gaussian distribution of $X$ trivially satisfies Assumption (A2). In the following lemma we show that this construction also satisfies Assumption (A1).

From now on, we define $C=C_{n}$ for notational simplicity.

Lemma B.5.

The above construction of satisfies a part of Assumption 1, i.e.

[TABLE]

Proof.

Fix $t$ such that $0<t<\frac{1}{4}$ . Then,

[TABLE]

The last inequality is valid when $m(\epsilon)\leq 2$ , which happens for $\epsilon\sqrt{p}$ sufficiently small. ∎

We use the notation $\beta\sim_{j}\beta^{\prime}$ if $\beta$ and $\beta^{\prime}$ differs only in $j^{th}$ position for $2\leq j\leq p$ . So, in order use Assouad’s lemma, we need an on $\|P^{n}_{\beta}-P^{n}_{\beta^{\prime}}\|_{TV}$ when $\beta\sim_{j}\beta^{\prime}$ for any $2\leq j\leq p$ . Fix $\beta_{1}$ and $\beta_{2}$ and $j\in\{2,\cdots,p\}$ such that $\beta_{1}\sim_{j}\beta_{2}$ . Using the standard relation between the total variation norm and Hellinger distance, we have:

[TABLE]

To make the minimax lower bound non-trivial, we will choose $\epsilon=\epsilon(n,p)$ in a way that ensures $H^{2}(P_{\beta_{1}},P_{\beta_{2}})\sim n^{-1}$ . Towards that, we need the following lemma:

Lemma B.6.

If $P_{1}$ = Ber( $p_{1}$ ) and $P_{2}$ = Ber( $p_{2}$ ) with $p_{1},p_{2}\in[1/4,3/4]$ , then $H^{2}(P_{1},P_{2})\leq\frac{\nu^{2}}{4\sqrt{3}s(1-s)}$ where $\nu=p_{2}-p_{1},s=(p_{1}+p_{2})/2$ .

The proof of this Lemma can be found in section B.12 of supplement. For the rest of the proof, define

[TABLE]

for $i=1,2$ . Now,

[TABLE]

We next divide the domain of $X$ into two sub-parts and compute the corresponding values of $\nu_{X}=P_{\beta_{1}}(Y=1|X)-P_{\beta_{2}}(Y=1|X)$ , on these sub-parts.

Case 1: $X\in A_{1}\cup A_{2}$ .

**Case 2: ** $X\in A_{1}^{c}\cap A_{2}^{c}$ . Note that, in this case, $|\nu_{X}|=0$ , if $\text{sign}(\beta_{1}^{\prime}X)=\text{sign}(\beta_{2}^{\prime}X)$ , $|\nu_{X}|\leq 2\left(\epsilon\sqrt{p}\vee\frac{|X_{1}|}{2Cm(\epsilon)}\right)$ otherwise.

Lemma B.7.

Under Case 1, $|\nu_{X}|=\left|P_{\beta_{1}}(Y=1|X)-P_{\beta_{2}}(Y=1|X)\right|\leq 2\epsilon|X_{j}|/Cm(\epsilon)$ where $\beta_{1}\sim_{j}\beta_{2}$ .

Proof.

First assume that, $X\in A_{1}\cap A_{2}$ . Then,

[TABLE]

Next, consider the case that $X\in A_{1}\cap A_{2}^{c}$ . Then, $|\beta_{2}^{\prime}X|>\left(C\epsilon\sqrt{p}\vee\frac{|X_{1}|}{2m(\epsilon)}\right)\wedge 1/4$ but $|\beta_{1}^{\prime}X|<\left(C\epsilon\sqrt{p}\vee\frac{|X_{1}|}{2m(\epsilon)}\right)\wedge 1/4$ . Hence,

[TABLE]

The case when $A_{1}^{c}\cap A_{2}$ follows in the exact same manner, by symmetry. ∎

We are now in a position to tackle $H^{2}(P_{\beta_{1}},P_{\beta_{2}})$ as shown below.

[TABLE]

We will analyze the expectation of each summand separately. Define $\tilde{\beta}\overset{\Delta}{=}\beta_{[2:p]}$ i.e. $\tilde{\beta}$ is a vector of dimension $(p-1)$ which we obtain by removing the first co-ordinate of $\beta$ , and let $\tilde{X}$ be defined similarly in terms of $X$ . We have:

[TABLE]

For the second part, observe that,

[TABLE]

Using this observation, we get,

[TABLE]

where $K$ is an absolute constant. Putting together B.7, B.8 and B.9, we get,

[TABLE]

Set $\zeta=\frac{128}{3\sqrt{3}}\sqrt{\frac{2}{\pi}}(1+\sqrt{3})$ . If we choose $\epsilon=\left(\frac{1}{2\zeta}\right)^{\frac{1}{3}}n^{-\frac{1}{3}}p^{-\frac{1}{6}}C^{\frac{2}{3}}$ , then $E_{X}\left(H^{2}(P_{\beta_{1}},P_{\beta_{2}})\right)\leq\frac{1}{2n}+\frac{64K}{3\sqrt{3}}e^{-\frac{C^{2}m(\epsilon)^{2}p}{4}}$ . So we have

[TABLE]

for all large $n$ , as $ne^{-\frac{C^{2}m(\epsilon)^{2}p}{4}}\rightarrow 0$ . Now we can relate Hamming distance to $\ell_{2}$ distance via

[TABLE]

and use Assouad’s lemma to deduce:

[TABLE]

for some constant $\tilde{K}_{L}$ . Finally, let $\hat{\beta}$ be any estimator assuming values in $S^{p-1}$ . Define $\tilde{\beta}$ to be the projection of $\hat{\beta}$ on the hypercube, i.e.

[TABLE]

Then for any $\beta\in\tilde{\theta}$ we have:

[TABLE]

Using this relation we can conclude that:

[TABLE]

where $K_{L}=\tilde{K}_{L}/4$ . To prove that the minimax rate cannot be improved upon $(p/n)$ , one can resort to a minimax construction taking $C=0$ . The construction and the rest of the proof follow a similar pattern as above, and are skipped for the sake of brevity.

B.4 Proof of Theorem 2.14

This proof is based on Theorem A.3. Recall that $\mathscr{M}_{i}$ is the collection of all models with $\|\beta\|_{0}\leq i$ for $1\leq i\leq L=\lfloor n/4\log{p}\rfloor$ . As mentioned previously, the model $\mathscr{M}_{i}$ has VC dimension $V_{i}\asymp i\log{(ep/i)}$ . Following the same line of argument as in the proof of Theorem 2.6 we can conclude:

[TABLE]

and the values of $\epsilon_{i}$ can be taken as:

[TABLE]

We know the function $f(x)=x\log{(t/x)}$ increases between $(0,t/e)$ and then decreases. As $L<p$ , the sequence of VC dimensions is increasing: $V_{1}\leq V_{2}\leq\cdots\leq V_{L}$ . Next, we establish that $V_{L}\leq n/e$ for all large $n$ . Assume to the contrary that $V_{L}>n/e$ . Then:

[TABLE]

which is a contradiction since the LHS goes to 1 as $n\rightarrow\infty$ . This immediately implies that:

[TABLE]

and

[TABLE]

This proves that

[TABLE]

Hence in our case,

[TABLE]

Now we need to choose a penalty function such that:

[TABLE]

If we choose

[TABLE]

then a permissible penalty function is given by $\text{Pen}(\mathscr{M}_{i})=2K\epsilon_{i}^{2}$ , provided we can show that $\sum_{i=1}^{L}e^{-x_{i}}<\infty$ . Towards that end:

[TABLE]

This ensures that our choice of $x_{i}^{\prime}s$ are valid. Applying Theorem A.3 along with the penalty function $\textbf{pen}(\mathscr{M}_{i})=2K\epsilon_{i}^{2}$ we obtain the following concentration bound on the excess risk:

[TABLE]

Taking $i=s_{0}$ :

[TABLE]

Putting this back in equation (B.10) we get:

[TABLE]

which further implies:

[TABLE]

We now argue that the remainder term $b(n)/(n\epsilon_{s_{0}}^{2})\to 0$ as $n\to\infty$ . First, observe that $n\epsilon_{s_{0}}^{2}\to 0$ as $n\to\infty$ . This is because:

[TABLE]

As both $V_{s_{0}}$ and $n/V_{s_{0}}$ diverge as $n\to\infty$ , it suffices to establish $\sqrt{C_{n}}/(n\epsilon_{1}\epsilon^{2}_{s_{0}})\to 0$ , to demonstrate that $b(n)/(n\epsilon_{s_{0}}^{2})\to 0$ . Towards that end:

[TABLE]

This completes the proof of the concentration bound on the excess risk. Using Proposition 2.4 we have:

[TABLE]

Thus, we have:

[TABLE]

and

[TABLE]

Combining equations (B.12) and (B.13) we conclude:

[TABLE]

for all $t\geq 1$ , for some constant $K_{1},K_{2}$ not depending on $(n,p,s_{0},C_{n})$ , with $r_{n}=\epsilon_{s_{0}}^{-1}$ and $s_{n}=\frac{b(n)}{n\epsilon^{2}_{s_{0}}}\to 0$ .

The proof of minimax upper bound of this estimation problem follows immediately from the exponential concentration bound on the estimation error:

[TABLE]

B.5 Proof of Corollary 2.16

In Theorem 2.14 we have established:

[TABLE]

for some constant $K_{1},K_{2}$ which does not depend on $(n,p,s_{0},C_{n})$ . Taking $t=1/\sqrt{s_{n}}$ we get:

[TABLE]

because $s_{n}\to 0$ as $n\to\infty$ . Hence, if $\beta^{0}_{\min}>(K_{1}+K_{2})r_{n}^{-1}$ , we conclude from above concentration bound:

[TABLE]

which completes the proof.

B.6 Proof of Proposition 3.1

By the definition of population loss function we have:

[TABLE]

where we define $\text{rank}(\mathbf{x}_{j}^{T}\beta)$ as the rank of the scalar number $\mathbf{x}_{j}^{T}\beta$ among the $m$ numbers $\{\mathbf{x}_{1}^{T}\beta,\mathbf{x}_{2}^{T}\beta,\dots,\mathbf{x}_{m}^{T}\beta\}$ in increasing order. Our claim is that for any realization of the vectors $\mathbf{x}_{1},\dots,\mathbf{x}_{m}$ , we have:

[TABLE]

To observe this, first note that from Assumption 1.1, the ordering of the vectors $\{p_{j}(\mathbf{X})\}_{j=1}^{m}$ is same as $\{\mathbf{x}_{j}^{T}\beta^{0}\}_{j=1}^{m}$ . Hence the above inequality follows from applying rearrangement inequality. The proof of the proposition also immediately follows from the inequality.

B.7 Proof of Proposition 3.8

First we show that under Assumption 3.2 we can lower bound the excess risk in terms of the probability of a suitably chosen wedge shaped region. For $j\neq k\in\{1,2,\dots,m\}$ and for any $\beta\in S^{p-1}$ define the region $X_{j,k,\beta}$ as:

[TABLE]

where for any matrix $A$ , its $i^{th}$ row is denoted by $A_{i,*}$ . Then we have:

[TABLE]

Defining $d_{\Delta}(\beta,\beta^{0})$ to be:

[TABLE]

we obtain:

[TABLE]

As this inequality is true for any $0\leq t\leq t^{*}$ , optimizing the same way as in the proof of Proposition 2.4 we conclude that:

[TABLE]

This concludes the proof.

B.8 Proof of Theorem 3.9

The proof of this theorem is quite similar to proof of Theorem 2.6. Hence we will skip some details here. As before, we work with the distance metric $\sqrt{d_{\Delta}(\beta,\beta^{0})}$ over the parameter space $S^{p-1}$ . Borrowing the notations from Theorem A.1 and using Proposition 3.8 we have:

[TABLE]

for $\omega\in\mathcal{C}_{1}$ . Hence the function $\omega(x)$ satisfy:

[TABLE]

Parametrizing $\sqrt{d_{\Delta}\left(\beta,\beta^{0}\right)}=t$ we get:

[TABLE]

Now consider the class of function

[TABLE]

As $f_{\beta}$ is average of $m(m-1)$ functions, where each function constitutes a VC class of VC dimension of order $p$ , the collection $\mathcal{F}$ has bounded uniform entropy integral. More precisely we have for any measure $Q$ :

[TABLE]

as each function is bounded by $1$ and $N(\epsilon,\mathcal{F},L_{2}(Q))$ is the covering number of $\mathcal{F}$ with respect to $L_{2}(Q)$ measure. The variability of the centered function can be bounded as:

[TABLE]

Finally to apply Theorem A.1 we need to obtain $\phi(\sigma)$ which satisfy condition (2) of that Theorem. Using Theorem 8.7 of Sen [2018] one can choose $\phi(\sigma)$ as:

[TABLE]

Now from Theorem A.1 we need to find $\phi(\sigma)$ for all the values of $\sigma$ such that $\phi(\sigma)\leq\sqrt{n}\sigma^{2}$ . From the above expression of $\phi(\sigma)$ , one can immediately conclude that:

[TABLE]

Hence we can take

[TABLE]

such that condition (3) of Theorem A.1 will be satisfied for all $\sigma$ such that $\phi(\sigma)\leq\sqrt{n}\sigma^{2}$ . Now we need to solve the equation

[TABLE]

to get $\epsilon_{*}$ . From the expression of $\phi(\sigma)$ we have:

[TABLE]

The above inequality will be satisfied if:

[TABLE]

Ignoring the constant $C$ , This will be satisfied if we take $\epsilon_{*}$ to be:

[TABLE]

Using this value of $\epsilon_{*}$ we conclude from Theorem A.1:

[TABLE]

or all $y\geq 1$ and for some constant $K$ which does not depend on $(n,m,p)$ . Now Proposition 3.8 implies along with equation (B.17):

[TABLE]

for all $y\geq 1$ , where

[TABLE]

and for some constant $K$ but does not depend on $(n,m,p)$ . This concludes the proof.

B.9 Proof of Theorem 3.12

The proof technique is essentially similar to that of Theorem 2.14, hence we will skip some details. As in the proof of previous theorem, the distance function that will be used heavily in the proof:

[TABLE]

From Assumption 3.11 we confine ourselves to search the best model upto sparsity level $L=(nc^{2}_{1})/(4m^{2}\log{p})$ . As in the proof of Theorem 2.14 we will use Theorem A.3 to show the model selection consistency here. Recall that $\mathscr{M}_{i}$ is defined to be the collection of all the models with all the $\beta$ such that $\|\beta\|_{0}\leq i$ . Hence,

[TABLE]

Now $f_{\beta}$ is sum of order $m^{2}$ many functions each of which has VC dimension of order $i\log{(ep/i)}$ (argued in Theorem 2.14). Using the same argument as in Theorem 3.9 we say that,

[TABLE]

As before, define $V_{i}=i\log{(ep)/i}$ . Using the same calculation as in the proof of Theorem 3.9 we conclude:

[TABLE]

and

[TABLE]

Hence, the value of $\epsilon_{i}$ can be takes as following similar calculation as in Theorem 3.9:

[TABLE]

Now we have to take the penalty function so that it satisfies:

[TABLE]

for some constant $K$ (as mentioned inTheorem A.3), where $x_{i}$ is same as defined in Theorem A.3. Taking $x_{i}=n\epsilon_{i}^{4}/\omega^{2}(\epsilon_{i})$ a valid choice of penalty function will be:

[TABLE]

Before applying Theorem A.3 with this choice of penalty function, we need to argue that this choice of $x_{i}$ is valid, i.e.

[TABLE]

for some constant $\Sigma$ . From the definition of $x_{i}$ we have:

[TABLE]

Similar analysis as in Theorem 2.14 yields: $V_{1}\leq\dots\leq V_{L}$ and $x_{1}\leq\dots\leq x_{L}$ . Hence we have:

[TABLE]

as $n\to\infty$ . Hence we can find some constant $\Sigma$ (in-fact one can take $\Sigma=1$ for all large $n$ ) such that equation (B.19) holds. Now, using Theorem A.3 we conclude:

[TABLE]

where $b(n)=\frac{\omega^{2}(\epsilon_{1})}{\epsilon_{1}^{2}}$ . Now in the RHS of equation (B.20) we can replace the infimum by its value at $i=s_{0}$ , the true sparsity of $\beta^{0}$ and get the following concentration bound on the excess risk:

[TABLE]

Using Proposition 3.8 we conclude:

[TABLE]

where

[TABLE]

via same argument in Theorem 3.8 and

[TABLE]

This concludes the proof of the Theorem with $a_{n}=\sqrt{s_{n}}$ .

B.10 Proof of Theorem A.3:

The proof is quite long and involved. Before going into details, we define some notation which will be frequently used throughout. Fix $y\geq 0$ and $\{x_{m}\}_{m\in\mathcal{M}}$ satisfying the condition of the theorem:

For all $m\in\mathcal{M}$ , $s_{m}=\mathop{\rm argmin}_{t\in S_{m}}\ell(s,t)$ . 2. 2.

$y^{2}_{m}=2K\left(\epsilon_{m}^{2}+\frac{\omega^{2}(\epsilon_{m})(x_{m}+y)}{n\epsilon_{m}^{2}}\right)\leq 2\textbf{pen}(M_{m})+2Ky\frac{\omega^{2}(\epsilon_{m})}{n\epsilon_{m}^{2}}\hskip 14.45377pt[\text{Defintion of }\textbf{pen}]$ . 3. 3.

$V_{m}=\sup_{t\in S_{m}}\frac{\left[\bar{\gamma}_{n}(s_{m^{\prime}})-\bar{\gamma}_{n}(t)\right]}{\ell(s,t)+\ell(s,s_{m^{\prime}})+y^{2}_{m}}$ for all $m\in\mathcal{M}$ .

Recall that $\hat{m}$ is defined to be the optimal model, i.e.:

[TABLE]

Fix $m^{\prime}\in\mathcal{M}$ . Then by definition of $\hat{m}$ :

[TABLE]

which implies:

[TABLE]

The rest of the proof is organized as follows. We first show that:

[TABLE]

for any $m\in\mathcal{M}$ , which implies from the union bound and the fact that $\sum_{i=1}^{L}e^{-x_{i}}\leq\Sigma$ ,

[TABLE]

Now, using equation (B.22), we obtain with probability larger that $1-\Sigma e^{-y}$ :

[TABLE]

Multiplying both sides by 2, we get:

[TABLE]

As this is true for any $m^{\prime}\in\mathcal{M}$ , we conclude:

[TABLE]

which implies for all $y\geq 0$ :

[TABLE]

Integrating with respect to $y$ we get the following upper bound on the expectation:

[TABLE]

Now we prove the bound (B.23), for which we use Theorem A.4. Our function class will be:

[TABLE]

First, we observe that the functions are uniformly bounded in terms of $y_{m}$ :

[TABLE]

Next, we bound the variability of the functions. Define $\omega_{1}=1\wedge\omega$ . We have:

[TABLE]

Applying Theorem A.4 yields with probability larger than $1-{\sf exp}{(-(x_{m}+y))}$ :

[TABLE]

Now we bound $E(V_{m})$ :

[TABLE]

For the next analysis, define $\omega_{1}=1\wedge\omega$ . We first analyze $T_{2}$ as follows:

[TABLE]

Next we analyze $T_{1}$ using Lemma A.2:

[TABLE]

where we define $a^{2}(t)=\ell(s,t)\vee\ell(s,s_{m^{\prime}})$ . We can relate to $a(t)$ to $d(t,s_{m})$ in the following way:

[TABLE]

Hence we have:

[TABLE]

Using Lemma A.2 we conclude:

[TABLE]

Combining equation B.28 and B.27 we conclude that:

[TABLE]

Putting this bound in equation B.26 we have:

[TABLE]

For large enough $K$ , clearly $V_{m}\leq 1/2$ . This completes the proof. $\Box$

B.11 Discussion on Lemma 6.1

Lemma B.8.

For any fixed $x\in S^{p-1}$ , define $C(x,\epsilon)$ to be $\epsilon$ -angular spherical cap around $x$ , i.e.

[TABLE]

Then we have

[TABLE]

for $\sqrt{\frac{2}{p}}\leq\epsilon\leq 1$ . The last inequality follows from the assumption $\sqrt{\frac{2}{p}}\leq\epsilon$ .

This lemma is a well-known fact in convex geometry. Note that, Lemma 6.1 and Lemma B.8 are in different scale as one of them involves the angle and the other one involves the distance. In the following Lemma we bridge this gap:

Lemma B.9.

For $0\leq r\leq 1$ and $p\geq 8$ , we have:

[TABLE]

Proof.

Note that $C(x,\epsilon)=D(x,r)$ where $\epsilon=(1-r^{2}/2)$ . If $r\leq 1$ and $p\geq 8$ then $\epsilon\geq\sqrt{\frac{2}{p}}$ . Hence we have:

[TABLE]

which completes the proof. ∎

Finally using Lemma B.9 we get the upper bound on $\sigma(D(x,r))$ . The lower bound can also be found in convex geometry literature. Combining them together, we get Lemma 6.1.

B.12 Proof of Lemma B.6

Define $x-(p_{1}-q_{1})/2=\nu/2$ . From the definition of Hellinger distance between two Bernoulli Random variables, we get,

[TABLE]

In the second last line we use mean value theorem:

[TABLE]

for some $\tilde{x}$ between [math] and $x$ . As our parameter space is $[1/4,3/4]$ , we have $p_{1}\leq 3q_{1}$ for any choice of $p_{1},q_{1}$ . Hence, $\frac{|x|}{s}\leq\frac{1}{2}$ and $\frac{|x|}{1-s}\leq\frac{1}{2}$ which immediately implies $\frac{\tilde{x}_{1}^{2}}{s^{2}}\leq\frac{1}{4}$ and $\frac{\tilde{x}_{2}^{2}}{(1-s)^{2}}\leq\frac{1}{4}$ , which, in turn, validates $\left(1-\frac{\tilde{x}_{1}^{2}}{s^{2}}\right)^{-1/2}\leq\frac{2}{\sqrt{3}}$ and $\left(1-\frac{\tilde{x}_{2}^{2}}{(1-s)^{2}}\right)^{-1/2}\leq\frac{2}{\sqrt{3}}$ . Using this in equation B.30 we conclude:

[TABLE]

B.13 Proof of Lemma 6.5

[TABLE]

Appendix C A discussion of the model with intercept

The binary choice model in the presence of intercept can be formulated as follows:

$(X,\epsilon)\overset{i.i.d.}{\sim}P$ with ${\sf med}(\epsilon|X)=0$ almost surely. 2. 2.

$Y=\text{sgn}(Y^{*})$ where $Y^{*}=\tau_{0}+X^{T}\beta^{0}+\epsilon$ .

The maximum score estimator can be defined as:

[TABLE]

with the population score function being $S(\tau,\beta)={\mathbb{E}}(Y\text{sgn}(\tau+X^{T}\beta))$ . In this model we can write the function $\eta(X)={\mathbb{P}}(Y=1|X)=1-F_{\epsilon|X}(-\tau-X^{T}\beta)$ . We take our parameter space to be $\{(\tau,\beta):\tau\in(-U,U)\,,\,\|\beta\|_{2}=1\}$ . For notational simplicity, define $\gamma=(\tau,\beta)$ . The transition assumption (Assumption 2.2) remains unchanged under the intercept model. Assumption 2.3 can be generalized for this model as follows:

Assumption C.1.

For all $\gamma$ sufficiently close to $\gamma^{0}$ ,

[TABLE]

Consider the linear transformation $Y=PX$ where:

[TABLE]

with $v_{3},\cdots,v_{p}$ being orthogonal extensions to a basis of $\mathbb{R}^{p}$ . Note that $Y$ depends on $\beta$ , but this will be suppressed in the notation. The following lemma presents conditions on the distribution of $X$ under which Assumption C.1 is valid.

Lemma C.2.

Suppose there exists $0<\delta<2$ and a constant $K$ such that $f_{Y_{1},Y_{2}}(y_{1},y_{2})\geq F$ for all $\{(y_{1},y_{2}):\|(y_{1},y_{2})\|_{2}\leq 2U/\zeta\}$ where $\zeta=\sqrt{1-\delta^{2}/4}$ and the bound $F=F(U,\zeta)$ is independent of $\beta$ and the dimension $p$ . Then

[TABLE]

holds for $\tau\in(-U,U)$ and for all $\beta:\|\beta-\beta^{0}\|_{2}\leq\delta$ .

As the wedge condition is only valid in a neighborhood of the true $\gamma^{0}$ , we need to establish the consistency of the maximum score estimator in order to prove the rate of convergence results.

Lemma C.3.

Under Assumption 2.2 and Assumption C.1 we have

[TABLE]

when $p/n\rightarrow 0$ . Furthermore, under Assumption 2.13, the result continues to hold when $p\gg n$ .

We next argue that the rate of convergence results in (Theorem 2.6 and Theorem 2.14) hold for the intercept model by slight modifications to the previous proofs.

Theorem C.4.

Under Assumption 2.2, C.1 and 2.13 we have:

[TABLE]

where

[TABLE]

for the slowly growing regime $p/n\to 0$ , and

[TABLE]

for the fast growing regime $p\gg n$ .

Under the intercept model, our class of classifier is:

[TABLE]

The VC dimension of this class is $d+1$ (For $p\gg n$ the VC dimension is at most $(s_{0}+1)\log{(p+1)}$ ). By the same arguments as in the proof of Proposition 2.4 we can show that

[TABLE]

As before, we next apply Theorem A.1. The first condition of the theorem remains valid as our parameter space $(-U,U)\times S^{p-1}$ admits a countable dense subset. The distance function $d_{\Delta}(f,f^{*})$ changes to the following:

[TABLE]

The remainder of the proof remains completely unchanged as can be verified by inspection.

Remark C.5.

It is not clear whether the minimax upper bound results in Theorems 2.6 and 2.14 hold. Recall that, to prove the minimax upper bound in these theorems, we used an exponential tail bound on the probability that $\|\hat{\beta}-\beta^{0}\|>t$ for every $t>0$ , derived via Theorem A.1, using the fact that the wedge condition Assumption 2.3 held for all $\beta$ . In the intercept model, the wedge condition only holds on a restricted part of the parameter space, and the exponential tail bound cannot be established for all $t$ . Nevertheless, the minimax lower bound rates obtained in Theorems 2.11 and 2.18 remain exactly the same, as we can take $\tau=0$ in the minimax constructions that arise in their proofs. Of course, the space of distributions changes, as we have introduced the intercept. We can rewrite these results as follows.

Theorem C.6.

For the slowly growing regime $p/n\to 0$ , we have :

[TABLE]

for some constant $K_{L}$ that does not depend on $(n,p)$ . For $C_{n}=C$ fixed, the lower bound is of the order $(p/n)^{2/3}$ . The supremum is taken over all distributions $P$ corresponding to binary response models satisfying Assumptions 2.2 and C.1 for some regression parameter $\gamma\in(-U,U)\times\mathcal{S}^{p-1}$ (viewed as a functional of $P$ ) but with $t^{*},a^{-},C_{n}$ held fixed.

Theorem C.7.

For the fast growth regime $p\gg n$ , we have:

[TABLE]

for some constant $\tilde{K}_{L}>0$ not depending on $(n,p,s_{0})$ . For the case $C_{n}=C$ fixed, the lower bound is of the order of $\left(\frac{s_{0}\log{(p/s_{0})}}{n}\right)^{2/3}$ . The supremum is taken over the same class of distributions as in Theorem C.6.

C.0.1 Which distributions satisfy Assumption C.1 ?

As stated in lemma C.2 we need the joint density of $(Y_{1},Y_{2})$ to be lower bounded by some non negative constant to establish the lower bound. Here we show that, under fairly general restrictions, any elliptically symmetric distribution and satisfies the assumption.

Lemma C.8.

Suppose the distribution of $X$ belongs to a consistent family of elliptical distribution with mean 0 i.e. the density of $X$ has the form:

[TABLE]

with $\Sigma$ being a full rank matrix. If $g_{2}$ (density generator of two dimensional marginal of $X$ ) is decreasing function on $\mathbb{R}^{+}$ with $g_{2}(x)>0$ for all $x$ and there exists constants $0<\lambda^{-}<\lambda^{+}<\infty$ such that

[TABLE]

for all $p$ then $X$ satisfies the assumption of Lemma C.2.

Proof.

The density of $Y=PX$ is $f_{Y}(y)=|\bar{\Sigma}|^{-\frac{1}{2}}g(y^{T}\bar{\Sigma}^{-1}y)$ where $\bar{\Sigma}=P\Sigma P^{T}$ . Then density of $(Y_{1},Y_{2})$ is $f_{Y_{1},Y_{2}}(y_{1},y_{2})=|\Sigma_{1}|^{-\frac{1}{2}}g_{2}((y_{1},y_{2})^{T}\Sigma_{1}^{-1}(y_{1},y_{2}))$ , where $\Sigma_{1}$ is the leading $2\times 2$ block of $\bar{\Sigma}$ . Now, if we confine ourselves on a ball of radius $2U/\zeta$ then:

[TABLE]

Hence $F(U,\zeta)=\frac{1}{\lambda^{+}}g_{2}\left(\frac{4U^{2}}{\zeta^{2}\lambda^{-}}\right)$ and assumption (A2:intercept) is satisfied. ∎

Lemma C.9.

Suppose the elements of the random vector $X=(X_{1},\dots X_{p})$ are independent and each component has a log concave density symmetric around 0 and variance 1. Then, there exists constants $\epsilon_{0},R>0$ such that $f_{Y_{1},Y_{2}}(y_{1},y_{2})\geq\epsilon_{0}$ on a circle of radius $R$ . Hence, Assumption (A2:intercept) is satisfied for all $\zeta$ such that $2U/\zeta\leq R$ .

Proof.

Denote the density of $X_{i}$ as $f_{i}$ . From the strong unimodality property of log concave densities, each $f_{i}$ has mode at 0. Also we have

[TABLE]

for all $i\in\{1,2,\dots,p\}$ . [See equation (2.2) of Bobkov and Chistyakov [2015]]. Hence $f_{i}(0)\geq 1/\sqrt{12}$ under variance = 1. Note that, as each component of $X$ has a symmetric strongly unimodal density, so does $a^{T}X$ for any $a\in\mathbb{R}^{p}$ . Consider $Y=P_{\beta}X$ as defined before Lemma C.2. Let $Z_{\phi}=Y_{1}\cos(\phi)+Y_{2}\sin(\phi)$ , then $Z_{\phi}$ is also strongly unimodal with mode at [math] (Recall that density of linear combination of random variables with log concave density is also log concave and any symmetric log concave density has mode at 0). As marginals of log-concave density is log-concave, the density of $Y_{1},Y_{2}$ is also log-concave i.e.

[TABLE]

where $g$ is a log-concave function on $\mathbb{R}^{2}$ with mode at 0. Then by the Jacobian transformation:

[TABLE]

where $g_{\phi}(x)=g(-x\sin(\phi),x\cos(\phi))$ . Some properties of $g_{\phi}(x)$ ’s are immediate:

$g_{\phi}$ is concave. 2. 2.

$g_{\phi}$ is symmetric around 0 as $g$ is symmetric around 0.

As $(Y_{1},Y_{2})$ has a two dimensional log concave density in $\mathbb{R}^{2}$ and ${\mathbb{E}}(\|Y\|^{2})=2$ , there exists an absolute constant $b$ such that $f_{Y_{1},Y_{2}}(y_{1},y_{2})\leq b\ \forall\ y_{1},y_{2}\in\mathbb{R}$ . (e.g. see Ball [1988]). Next, we show that, there exists a universal constant $\epsilon_{0}>0$ such that $e^{g(-x\sin(\phi),x\cos(\phi))}\geq\epsilon_{0}$ for all $|x|\leq\frac{1}{2b\sqrt{13}}$ , for all $\phi\in[0,2\pi)$ which implies $f_{Y_{1},Y_{2}}(y_{1},y_{2})\geq\epsilon_{0}$ on a circle $R=\frac{1}{2b\sqrt{13}}$ . Fix $\phi\in[0,2\pi)$ . Denote $e^{g_{\phi}(1/2\sqrt{13}b)}$ by $\epsilon$ . Then, due to concavity, $g_{\phi}(x)$ lies below the line joining $(0,g_{\phi}(0))$ and $\left((1/2b\sqrt{13}),g_{\phi}(1/2b\sqrt{13})\right)$ i.e.

[TABLE]

for $x>1/(2b\sqrt{13})$ . This implies:

[TABLE]

The last equation follows from the lower bound on the mode of two dimensional log concave density (see Lemma 6 of Ball [1988]). Hence $\frac{2\epsilon}{2b\sqrt{13}(1/4\pi-\log{\epsilon})}\geq(1/\sqrt{12}-1/\sqrt{13})$ . Define $\Psi(s)=\frac{2s}{2b\sqrt{13}(1/4\pi-\log{s})}$ . As $\Psi(s)$ is strictly increasing on $(0,1)$ we conclude $\epsilon\geq\Psi^{-1}(1/\sqrt{12}-1/\sqrt{13})=\epsilon_{0}$ . This immediately implies $e^{g_{\phi}(x)}\geq\epsilon_{0}$ for $|x|\leq 1/2\sqrt{13}b$ from the fact that $g_{\phi}(x)$ has mode 0. Also the value of $\epsilon_{0}$ does not depend on $\phi$ , which implies, $e^{g_{\phi}(x)}\geq\epsilon_{0}\ \forall\ |x|\leq 1/2b\sqrt{13}\ \forall\phi\in[0,2\pi)$ . This completes the proof with $R=1/2b\sqrt{13}$ . ∎

C.1 Proof of Lemma C.3

First we show that as $p/n\rightarrow 0$ ,

[TABLE]

i.e. the class of functions $\mathcal{G}=\mathcal{G}_{p}=\{g_{\gamma}:{\mathbb{R}}^{p}\times\{-1,1\}\rightarrow\{-1,1\},g_{\gamma}(x,y)=y\text{sgn}(\tau+x^{T}{\beta})\}$ is Glivenko-Cantelli class which is equivalent to showing (for details see Pollard [1981]):

There exists $G$ , an envelope of $\mathcal{G}$ such that $P^{*}G\leq\infty$ . 2. 2.

$\lim_{n\rightarrow\infty}\frac{E^{*}\left(\log{(N(\epsilon,\mathcal{G}_{m},L_{2}({\mathbb{P}}_{n})))}\right)}{n}=0$ for all $M<\infty,\epsilon>0$ , where $N(\epsilon,\mathcal{G}_{m},L_{2}({\mathbb{P}}_{n}))$ is the $\epsilon$ covering number of the set $\mathcal{G}_{m}=\{g_{\beta}\mathds{1}_{G\leq M}:g_{\beta}\in\mathcal{G}\}$ with respect to $L_{2}({\mathbb{P}}_{n})$ norm.

Clearly $G\equiv 1$ is an integrable envelope of $\mathcal{G}$ . Now $\mathcal{G}$ is VC class of VC dimension $v=(p+1)$ . Hence, we have:

[TABLE]

for some universal constant $K$ and $0\leq\epsilon\leq 1$ . Using this, we have:

[TABLE]

if $v/n\rightarrow 0\iff p/n\rightarrow 0$ which completes the proof.

In the previous step we have established that $S_{n}(\gamma)\rightarrow S(\gamma)$ uniformly over $\gamma$ . Now we need to prove $\hat{\gamma}=\mathop{\rm argmax}_{\gamma}S_{n}(\gamma)$ converges to $\gamma^{0}=\mathop{\rm argmax}_{\gamma}S(\gamma)$ . Towards that we need the following Lemma:

Lemma C.10.

Given any $0\leq\epsilon_{1}<\epsilon_{2}\leq 2$ and $\gamma_{1}$ such that $\|\gamma_{1}-\gamma^{0}\|_{2}=\epsilon_{2}$ , we can find $\gamma_{2}$ with $\|\gamma_{2}-\gamma^{0}\|_{2}\leq\epsilon_{1}$ such that

[TABLE]

We defer the proof of this lemma to the next subsection. Using the same proof as Proposition 2.4 we have:

[TABLE]

which is now true for $\|\gamma-\gamma^{0}\|_{2}\leq\delta$ under the assumptions of Theorem C.3. Suppose $0\leq\epsilon<\delta$ , then using Lemma C.10 we have:

[TABLE]

which completes the proof for $p/n$ going to 0.

The same proof works for $p\gg n$ under our assumption $(s_{0}\log{p})/n\rightarrow 0$ , because, what is really needed in the above proof is the condition $V/n\rightarrow 0$ where $V$ is the VC dimension of the set of classifiers under consideration. When $p\gg n$ , $V=O(s_{0}\log{p})$ under the sparsity assumption, and therefore by our assumption $V/n\rightarrow 0$ in this case as well. $\Box$

C.2 Proof of Lemma C.10

Under the assumption that ${\sf med}(\epsilon|X)=0$ in our model, we have for any $\beta$ :

[TABLE]

where $X_{\gamma}=\{x:\text{sgn}(\tilde{x}^{T}\gamma)\neq\text{sgn}(\tilde{x}^{T}\gamma^{0})\}$ with $\tilde{x}=(1,x^{T})^{T}$ . Now a fix $\gamma_{1}$ with $\|\gamma_{1}-\gamma^{0}\|_{2}=\epsilon_{1}$ . Define $\gamma_{2}=\frac{\lambda\gamma_{1}+(1-\lambda)\gamma^{0}}{\|\lambda\beta_{1}+(1-\lambda)\beta^{0}\|_{2}}$ for some $\lambda\in(0,1/2)$ which will be chosen later. Suppose $x\in X_{\gamma_{2}}$ :

**Case 1: ** Suppose $x^{T}\gamma_{2}>0>x^{T}\gamma^{0}$ . Then

[TABLE]

**Case 2: ** Suppose $x^{T}\gamma_{2}<0<x^{T}\gamma^{0}$ . Then

[TABLE]

Hence $X_{\gamma_{2}}\subseteq X_{\gamma_{1}}$ . Now $\|\lambda\beta_{1}+(1-\lambda)\beta^{0}\|_{2}\geq(1-2\lambda)$ by triangle inequality and using the fact that $\|\beta_{1}\|=\|\beta^{0}\|=1$ . Therefore,

[TABLE]

To conclude the proof we choose $\lambda$ such that $\lambda(\epsilon_{1}+2)/(1-2\lambda)=\epsilon_{2}$ i.e. $\lambda=\epsilon_{2}/(\epsilon_{1}+2+2\epsilon_{2})$ .

C.3 Proof of Lemma C.2

From the transformation $Y=PX$ , we can write $a_{1}Y_{1}+a_{2}Y_{2}=X^{T}\beta^{0}$ and $X^{T}\beta=a_{1}Y_{1}-a_{2}Y_{2}$ where $a_{1}=\frac{1}{2}\|\beta+\beta^{0}\|_{2},a_{2}=\frac{1}{2}\|\beta-\beta^{0}\|_{2}$ . We divide the proof into three cases:

Case 1: Suppose $\tau\neq\tau^{0},\beta\neq\beta^{0}$ . The probability of the wedge shaped region can be written as:

[TABLE]

which is the probability of the region between the straight lines: $a_{1}Y_{1}+a_{2}Y_{2}+\tau^{0}=0$ and $a_{1}Y_{1}-a_{2}Y_{2}+\tau=0$ . The intersection of these two lines is $I=(-(\tau+\tau^{0})/2a_{1},(\tau-\tau^{0})/2a_{2})$ , the line $a_{1}Y_{1}+a_{2}Y_{2}+\tau^{0}=0$ meets the $X$ -axis at $J=(-\tau_{0}/a_{1},0)$ and the line $a_{1}Y_{1}-a_{2}Y_{2}+\tau=0$ meets the $X$ -axis at $K=(-\tau/a_{1},0)$ . From our assumptions $\|\beta-\beta^{0}\|_{2}\leq\delta$ we have $a_{1}=\frac{1}{2}\|\beta+\beta^{0}\|_{2}\geq\sqrt{1-\delta^{2}/4}=\zeta$ (say). Hence, $|\tau|/a_{1}\leq U/\zeta$ for all $\tau$ , indicating that the intersection points with the $X$ - axis (denoted by J,K) lie within a circle of radius $2U/\zeta$ around origin.

**Case 1.1: **Suppose the point I is inside the circle of radius $2U/\zeta$ . The points $J,K$ are inside by definition.

Denote $L$ to be the midpoint of $KJ$ . If we denote the angle $\angle KIJ$ to be $\theta$ , then $\theta=2(\tan{(-a_{1}/a_{2})})+\pi$ (which directly follows from the slope of the lines and from the observation that $\Delta IKJ$ is isoceles) and $\tan(\theta/2)=KL/LI$ . The length of the side $LI\leq 4U/\zeta$ (diameter of the circle) which implies:

[TABLE]

as $|a_{1}|\leq 1$ . On the other hand we have following upper bound on $\tan{(\theta/2)}$ :

[TABLE]

Combining C.2 and C.4 we have $\|\beta-\beta^{0}\|_{2}\geq\frac{\zeta^{2}}{4U}|\tau-\tau^{0}|$ . Define $L_{1}$ to be the point where extended $IK$ meets the circle and $L_{2}$ to be the point where extended $IJ$ meets the circle. ( $L_{2}$ may be equal to $J$ ). The triangle $\Delta IL_{1}L_{2}$ is inside the circle and

[TABLE]

Now $IL_{1},IL_{2}\geq U/\zeta$ as the maximum possible distance of $K,J$ from the origin is $U/\zeta$ and $I$ is on the opposite side of $L_{1},L_{2}$ with respect to the $X$ -axis. Hence, $\text{Area}(\Delta IL_{1}L_{2})\geq\frac{U^{2}}{2\zeta^{2}}\sin{\theta}$ . Next,

[TABLE]

Recall that from C.3 it is easy to see $\sin{(\theta/2)}=\frac{1}{2}\|\beta-\beta\|_{2},\cos{(\theta/2)}=\frac{1}{2}\|\beta+\beta^{0}\|_{2}$ . Hence, we have $\text{Area}(\Delta IL_{1}L_{2})\geq\frac{U^{2}}{2\zeta}\|\beta-\beta^{0}\|$ . which implies that :

[TABLE]

**Case 1.2: **Suppose the intersection point $I$ is outside of the circle.

Here, the length of $LI$ is $\geq\sqrt{3}\frac{U}{\zeta}$ as $I$ is outside the circle and the maximum possible distance of $L$ from the origin is $U/\zeta$ . Using this, we have:

[TABLE]

Also, from equation C.3, we obtain $\tan{(\theta/2)}\geq(1/2)\|\beta-\beta^{0}\|$ . Combining these bounds, we have, $\|\beta-\beta^{0}\|_{2}\leq\frac{1}{\sqrt{3}U}|\tau-\tau^{0}|$ . Let the line $a_{1}Y_{1}-a_{2}Y_{2}+\tau=0$ cuts the circle at $M_{1},M_{2}$ . Consider the triangle $\Delta M_{1}KJ$ . Then the area of this triangle is:

[TABLE]

where $\phi=\angle M_{1}KJ$ . By the same logic as before, $M_{1}K\geq\frac{U}{\zeta}$ , $KJ=|\tau-\tau^{0}|/a_{1}$ and $\sin{\phi}=\sin{(\tan^{-1}(a_{1}/a_{2}))}=a_{1}$ . Hence, area of $\Delta M_{1}KJ\geq\frac{U}{2\zeta}|\tau-\tau^{0}|$ . Using this, we have:

[TABLE]

**Case 2: **Suppose $\tau=\tau^{0}$ and $\beta\neq\beta^{0}$ . Then the lines $a_{1}Y_{1}+a_{2}Y_{2}=-\tau^{0}$ and $a_{1}Y_{1}-a_{2}Y_{2}=-\tau(\equiv\tau^{0})$ meet on the $X$ -axis, i.e. $I=K=J=(-\tau^{0}/a_{1})$ .

Consider the triangle $\Delta IL_{1}L_{2}$ where $L_{1}$ and $L_{2}$ are the intersection points of the lines with the circle. Now the maximum possible distance of $I$ from the origin is $U/\zeta$ which implies $IL_{1},IL_{2}\geq U/\zeta$ . From C.5 we have $\sin\angle L_{1}IL_{2}\geq\zeta\|\beta-\beta^{0}\|_{2}$ . Combining these, we get:

[TABLE]

**Case 3: **Finally suppose $\beta=\beta^{0}$ and $\tau\neq\tau^{0}$ .

Consider the rectangle $\Box EIHG$ . Here $EG=IH=KL=|\tau-\tau^{0}|/a_{1}=|\tau-\tau^{0}|$ . Also $EI=GH=2EK\geq\frac{\sqrt{3}U}{\zeta}$ . Hence

[TABLE]

which establishes:

[TABLE]

Combining equations C.6, C.7, C.8 and C.9 we conclude that Assumption (A2:upper) is valid for this intercept model with $a^{-}=a_{1}^{-}\wedge a_{2}^{-}\wedge a_{3}^{-}\wedge a_{4}^{-}$ .

Appendix D Another version of rate theorem

In this section we present a version of Theorem 3.2.5 of Van Der Vaart and Wellner [1996] which provides the rate of convergence of a generic $M$ -estimator along with an exponential tail bound, under appropriate conditions. This theorem can be applied instead of Theorem A.1 to establish rate of convergence along with finite sample concentration bound.

Theorem D.1.

Suppose $\mathbb{M}_{n}$ be a stochastic processes indexed by a set $\Theta$ and $M:\Theta\rightarrow\mathbb{R}$ be a deterministic process which takes the form: $M(\theta)=Pf_{\theta}$ and ${\mathcal{M}}_{n}={\mathbb{P}}_{n}f_{\theta}$ . Define $\mathcal{F}=\{f_{\theta}:\theta\in\Theta\}$ . Assume for every $\theta$ in a neighborhood of $\theta_{0}$ :

[TABLE]

Suppose that for every $n$ and for sufficiently small $\delta$ , the centered process $\mathbb{M}_{n}-M$ satisfies:

[TABLE]

for functions $\phi_{n}$ such that $\delta\rightarrow\phi_{n}(\delta)/\delta^{\alpha}$ is decreasing for some $0<\alpha<\gamma$ . Let $\{r_{n}\}$ satisfies

[TABLE]

for every $n$ . If the sequence $\hat{\theta}_{n}$ takes value in $\Theta$ and satisfies $\mathbb{M}_{n}(\hat{\theta}_{n})\leq\mathbb{M}_{n}(\theta_{0})-O_{p}(r_{n}^{-2})$ and $d(\hat{\theta}_{n},\theta_{0})$ converges to [math] in outer probability, then $r_{n}d(\hat{\theta}_{n},\theta_{0})=O_{p}^{*}(1)$ . If all the above conditions are valid for all $\delta$ and $\theta$ , then we don’t need consistency and we can obtain the following finite sample concentration bound:

[TABLE]

In addition, assume $\|f_{\theta}\|_{\infty}\leq U$ (w.l.o.g. take $U=1/4$ for simplicity of notation) for all $\theta\in\Theta$ and the existence of $0<\beta<2\gamma$ such that:

[TABLE]

Then, the following exponential concentrations obtain, for all $t>(1/2)2^{-\frac{\gamma+1}{\gamma-\alpha}}\vee 1/2$ :

If $\gamma<\beta<2\gamma$ and $\liminf_{n\rightarrow\infty}nr_{n}^{-\gamma}>0$ , then

[TABLE] 2. 2.

If $0<\beta\leq\gamma$ and $\liminf_{n\rightarrow 0}nr_{n}^{\beta-2\gamma}>0$ , then

[TABLE]

Here the constants $C,c$ may be different in Case 1 and Case 2, but they don’t depend on $n$ .

Proof.

For simplicity let’s assume the conditions are valid for $\delta,\theta$ . We establish the finite sample concentration here. Fix $t>1$ : Define $C_{i}=\{\theta:d(\theta,\theta_{0})\leq tr_{n}^{-1}2^{i}\}$ for $i\in\mathbb{N}$ . Also define $g_{\theta}(X)=f_{\theta_{0}}(X)-f_{\theta}(X)-Pf_{\theta_{0}}+Pf_{\theta}$ and without loss of generality assume $\|g_{\theta}\|_{\infty}\leq 1$ . We also need the following quantities to apply Talagrand’s inequality:

[TABLE]

Now we manipulate the last sum. For ease of understanding we divide the rest of the proof into three parts. First, assume $\gamma=\beta$ . From the assumption of the theorem, there exists $c>0$ such that $\liminf_{n\rightarrow\infty}nr_{n}^{-\gamma}\geq c$ . As $0<\alpha<\gamma$ , for all $t>1/2$ , $(t2^{i})^{-\gamma+\alpha}<1$ .

[TABLE]

Finally assume that $t>(1/2)2^{-\frac{\gamma+1}{\gamma-\alpha}}$ . Then $2^{\gamma}(t2^{i})^{-\gamma+\alpha}<1/2$ for all $i\geq 1$ which implies $\left(1-2^{\gamma}(t2^{i})^{-\gamma+\alpha}\right)^{2}\geq 1/4$ . Putting this we get:

[TABLE]

Next we solve the series for the case when $0<\beta<\gamma$ . We assume here $\liminf_{n\rightarrow\infty}nr_{n}^{\beta-2\gamma}=c>0$ as stated in the theorem. Like before. lets assume $t>(1/2)2^{-\frac{\gamma+1}{\gamma-\alpha}}\vee 1/2$ . Then $2^{\gamma}(t2^{i})^{-\gamma+\alpha}<1/2$ for all $i\geq 1$ which implies $\left(1-2^{\gamma}(t2^{i})^{-\gamma+\alpha}\right)^{2}\geq 1/4$ . Also, as $\beta<\gamma$ , we have $r_{n}^{\gamma+\beta}<1$ for all large $n$ . Hence we have:

[TABLE]

Finally let’s assume $\gamma<\beta<2\gamma$ . Then we have the assumption $\liminf_{n\rightarrow 0}nr_{n}^{-\gamma}=c>0$ . Assume $t>(1/2)2^{-\frac{\gamma+1}{\gamma-\alpha}}\vee 1/2$ . Then $2^{\gamma}(t2^{i})^{-\gamma+\alpha}<1/2$ for all $i\geq 1$ which implies $\left(1-2^{\gamma}(t2^{i})^{-\gamma+\alpha}\right)^{2}\geq 1/4$ . Here as $\beta>\gamma$ we have $r_{n}^{\gamma-\beta}<1$ for all large $n$ . Then we have:

[TABLE]

Remark D.2.

Here I have used the inequality $2^{i\gamma}>i$ and $2^{i(\gamma-\beta)}>i$ for notational simplicity. For more exact bound, one can use the fact that $a^{i}\geq ie\log{a}$ for all $i$ for all $a>1$ .

∎

Bibliography48

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Abrevaya and Huang [2005] Jason Abrevaya and Jian Huang. On the bootstrap of the maximum score estimator. Econometrica , 73(4):1175–1204, 2005.
2Assouad [1983] Patrice Assouad. Deux remarques sur l’estimation. Comptes rendus des séances de l’Académie des sciences. Série 1, Mathématique , 296(23):1021–1024, 1983.
3Bajari et al. [2008] Patrick Bajari, Jeremy T Fox, and Stephen P Ryan. Evaluating wireless carrier consolidation using semiparametric demand estimation. Quantitative Marketing and Economics , 6(4):299, 2008.
4Ball [1988] Keith Ball. Logarithmically concave functions and sections of convex sets in rn. Studia Math , 88(1):69–84, 1988.
5Bickel et al. [2009] Peter J Bickel, Ya’acov Ritov, Alexandre B Tsybakov, et al. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics , 37(4):1705–1732, 2009.
6Bobkov and Chistyakov [2015] Sergey G Bobkov and Gennadiy P Chistyakov. On concentration functions of random variables. Journal of Theoretical Probability , 28(3):976–988, 2015.
7Bousquet [2002] Olivier Bousquet. A bennett concentration inequality and its application to suprema of empirical processes. Comptes Rendus Mathematique , 334(6):495–500, 2002.
8Briesch et al. [2002] Richard A Briesch, Pradeep K Chintagunta, and Rosa L Matzkin. Semiparametric estimation of brand choice behavior. Journal of the American Statistical Association , 97(460):973–982, 2002.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Optimal Linear Discriminators For The Discrete Choice Model In Growing Dimensions

Abstract

1 Introduction

Assumption 1.1** (Rank ordering property).**

2 Asymptotic properties and minimax bounds

Assumption 2.1** (Soft margin Assumption).**

Assumption 2.2** (Transition condition).**

Assumption 2.3** (Distribution assumption on covariates).**

Proposition 2.4**.**

Remark 2.5**.**

2.1 Rate of convergence when: p/n→0p/n\rightarrow 0p/n→0

Theorem 2.6**.**

Remark 2.7**.**

Assumption 2.8**.**

Theorem 2.9**.**

Remark 2.10**.**

Theorem 2.11** (Minimax Lower bound).**

Remark 2.12**.**

2.2 Rate of convergence when p≫np\gg np≫n

Assumption 2.13** (Sparsity Assumption).**

Theorem 2.14**.**

Remark 2.15**.**

Corollary 2.16**.**

Remark 2.17**.**

Theorem 2.18**.**

Remark 2.19**.**

3 Multinomial discrete choice model

Proposition 3.1**.**

Assumption 3.2** (Transition condition).**

Assumption 3.3** (Restricted wedge assumption).**

Remark 3.4**.**

Remark 3.5**.**

Remark 3.6**.**

Remark 3.7**.**

Proposition 3.8**.**

Theorem 3.9** (When p/n→0p/n\rightarrow 0p/n→0).**

Remark 3.10**.**

Assumption 3.11** (Sparsity condition for Multinomial model).**

Theorem 3.12** (When p≫np\gg np≫n).**

Remark 3.13**.**

4 Computational Aspects

4.1 Estimation error p=o(n)p=o(n)p=o(n)

4.2 Model selection and estimation when p≫np\gg np≫n

5 Concluding Discussion

5.1 Exploring and relaxing our assumptions:

5.2 Model with intercept:

5.3 Asymptotic distribution

6 Selected Proofs

6.1 Proof of Theorem 2.9

Lemma 6.1**.**

Lemma 6.2**.**

Proof.

Lemma 6.3**.**

Proof.

Lemma 6.4**.**

Proof.

6.2 Proof of Theorem 2.18

Lemma 6.5**.**

Lemma 6.6** **(Gilbert-Varshamov).

Lemma 6.7**.**

Proof.

Lemma 6.8**.**

Lemma 6.9**.**

Proof.

Appendix A Some important results

Theorem A.1**.**

Lemma A.2** (A maximal inequality for weighted empirical process).**

Theorem A.3** (Model selection consistency).**

Theorem A.4** (Bousquet’s version of Talagrand inequality).**

Remark A.5**.**

Appendix B Proofs of Theorems and Lemmas

B.0.1 Proof of Proposition 2.4

B.1 Some sufficient conditions for Assumptions 2.3 and 2.8

Assumption 1.1 (Rank ordering property).

Assumption 2.1 (Soft margin Assumption).

Assumption 2.2 (Transition condition).

Assumption 2.3 (Distribution assumption on covariates).

Proposition 2.4.

Remark 2.5.

2.1 Rate of convergence when: $p/n\rightarrow 0$

Theorem 2.6.

Remark 2.7.

Assumption 2.8.

Theorem 2.9.

Remark 2.10.

Theorem 2.11 (Minimax Lower bound).

Remark 2.12.

2.2 Rate of convergence when $p\gg n$

Assumption 2.13 (Sparsity Assumption).

Theorem 2.14.

Remark 2.15.

Corollary 2.16.

Remark 2.17.

Theorem 2.18.

Remark 2.19.

Proposition 3.1.

Assumption 3.2 (Transition condition).

Assumption 3.3 (Restricted wedge assumption).

Remark 3.4.

Remark 3.5.

Remark 3.6.

Remark 3.7.

Proposition 3.8.

Theorem 3.9 (When $p/n\rightarrow 0$ ).

Remark 3.10.

Assumption 3.11 (Sparsity condition for Multinomial model).

Theorem 3.12 (When $p\gg n$ ).

Remark 3.13.

4.1 Estimation error $p=o(n)$

4.2 Model selection and estimation when $p\gg n$

Lemma 6.1.

Lemma 6.2.

Lemma 6.3.

Lemma 6.4.

Lemma 6.5.

Lemma 6.6 (Gilbert-Varshamov).

Lemma 6.7.

Lemma 6.8.

Lemma 6.9.

Theorem A.1.

Lemma A.2 (A maximal inequality for weighted empirical process).

Theorem A.3 (Model selection consistency).

Theorem A.4 (Bousquet’s version of Talagrand inequality).

Remark A.5.

Lemma B.1.

Lemma B.2.

Remark B.3.

Lemma B.4.

Lemma B.5.

Lemma B.6.

Lemma B.7.

Lemma B.8.

Lemma B.9.

Assumption C.1.

Lemma C.2.

Lemma C.3.

Theorem C.4.

Remark C.5.

Theorem C.6.

Theorem C.7.

Lemma C.8.

Lemma C.9.

Lemma C.10.

Theorem D.1.

Remark D.2.