Optimal Linear Discriminators For The Discrete Choice Model In Growing Dimensions
Debarghya Mukherjee, Moulinath Banerjee, Ya'acov Ritov

TL;DR
This paper investigates the behavior of Manski's maximum score estimator for discrete choice models in high-dimensional settings, deriving convergence rates, bounds, and optimal estimators for different growth regimes of the dimension.
Contribution
It extends the analysis of the maximum score estimator to scenarios where the number of predictors grows with the sample size, providing convergence rates, bounds, and computational methods.
Findings
Derived $ ext{ell}_2$ convergence rates under different growth regimes.
Established minimax bounds for estimation error in high dimensions.
Proposed algorithms for computing the maximum score estimator in large dimensions.
Abstract
Manski's celebrated maximum score estimator for the discrete choice model, which is an optimal linear discriminator, has been the focus of much investigation in both the econometrics and statistics literatures, but its behavior under growing dimension scenarios largely remains unknown. This paper addresses that gap. Two different cases are considered: grows with but at a slow rate, i.e. ; and (fast growth). In the binary response model, we recast Manski's score estimation as empirical risk minimization for a classification problem, and derive the rate of convergence of the score estimator under a \emph{transition condition} in terms of our margin parameter that calibrates the level of difficulty of the estimation problem. We also establish upper and lower bounds for the minimax error in the binary choice model that differ by a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEconomic theories and models · Economic Growth and Productivity · Economic Policies and Impacts
Optimal Linear Discriminators For The Discrete Choice Model In Growing Dimensions
Debarghya Mukherjeelabel=e1][email protected] [
Moulinath Banerjeelabel=e2][email protected] [
Ya’acov Ritovlabel=e3][email protected] [ University of Michigan
University of Michigan
437, West Hall,
1085 South University
Ann Arbor, MI 48109
University of Michigan
275, West Hall,
1085 South University
Ann Arbor, MI 48109
University of Michigan
462, West Hall,
1085 South University
Ann Arbor, MI 48109
Abstract
Manski’s celebrated maximum score estimator for the discrete choice model, which is an optimal linear discriminator, has been the focus of much investigation in both the econometrics and statistics literatures, but its behavior under growing dimension scenarios largely remains unknown. This paper addresses that gap. Two different cases are considered: grows with but at a slow rate, i.e. ; and (fast growth). In the binary response model, we recast Manski’s score estimation as empirical risk minimization for a classification problem, and derive the rate of convergence of the score estimator under a transition condition in terms of our margin parameter that calibrates the level of difficulty of the estimation problem. We also establish upper and lower bounds for the minimax error in the binary choice model that differ by a logarithmic factor, and construct a minimax-optimal estimator in the slow growth regime. Some extensions to the general case – the multinomial response model – are also considered. Last but not least, we use a variety of learning algorithms to compute the maximum score estimator in growing dimensions.
\arxiv
arXiv:1903.10063 \startlocaldefs
\endlocaldefs
,
and
t1Supported by NSF Grant DMS-1712962
1 Introduction
The maximum score estimator for the discrete choice model was introduced by Charles Manski in his seminal paper Manski [1975] in connection with the stochastic utility model of choice, and has been extensively studied in both the econometrics and the statistics literatures. The binary choice model can be considered as a linear regression model with missing data. More specifically, let
[TABLE]
where are we i.i.d pairs, the distribution of is allowed to depend on and (i.e. ), but instead of observing the full data, we only get to see where
[TABLE]
The regression parameter is of interest. The population score function is defined as:
[TABLE]
and the corresponding sample score function is:
[TABLE]
The maximum score estimator is defined as any value of that maximizes the sample/empirical score function:
[TABLE]
Note that some norm restriction on is important both for identifiability of in this model, as well as for meaningful optimization. As is only identifiable and estimable up to direction, in what follows, we take . We also note that the choice of the maximizer is not important; in fact there is no unique maximizer. In follow-up work Manski [1985], Manski proved the consistency of to the true and some large deviation results under mild assumptions. The asymptotic distribution properties of the maximum score estimator were established by Kim and Pollard Kim et al. [1990] who proved that under additional assumptions and that the normalized difference converges in distribution to a non-Gaussian random variable that is characterized as the maximizer of a quadratically drifted Gaussian process. Shortly thereafter, Horowitz Horowitz [1992] established, under smoothness conditions beyond those in Kim et al. [1990], the estimator obtained by maximizing a kernel smoothed version of the score function can improve the rate of the smoothed estimator. One advantage of Horowitz’s estimator over the original maximum score estimator, from a practical viewpoint, is that the limit distribution in his setting is Gaussian and therefore more amenable to inference, while the quantiles of the non-Gaussian limit are hard to determine. Also, around the same time Klein and Spady (see Klein and Spady [1993]) proved that under the additional assumption , one can obtain a consistent and asymptotically normal estimator of , which is also semi-parametrically efficient. More recently, Seo and Otsu (Seo et al. [2018], Seo and Otsu [2015]) have extended the asymptotic results on the score estimator to dependent data scenarios. Alternatively, resampling techniques can also be used for inference. Manski and Thompson Manski and Thompson [1986] suggested that the usual bootstrap yields a good approximation of the distribution of the maximum score estimator, but it turns out that the bootstrap is actually inconsistent, as shown in Abrevaya and Huang Abrevaya and Huang [2005] (but see also Sen et al. [2010]). More recently, a model–based smoothed bootstrap approach was proposed by Patra et.al. Patra et al. [2018]. Generic ( out of ) subsampling techniques Politis et al. [1999] can, of course, be used in principle, but typically suffer from imprecise coverage unless the subsample size is well-chosen, which is typically a difficult problem. For applications of maximum score estimators and their variants, see Briesch et al. [2002], Fox and Bajari [2013], Bajari et al. [2008] and references therein.
Connections to empirical risk minimization: The maximum score estimator is naturally connected to a classification problem with two classes. In Manski’s problem, we have observations , where and , these being the labels of the two classes. The conditional class probabilities are specified by
[TABLE]
For classifying the ’s using an arbitrary classifier under 0-1 loss, the population risk is given by . Consider the set of classifiers corresponding to all possible hyperplanes, i.e.
[TABLE]
The population risk under 0-1 loss for this family is then given by:
[TABLE]
and is consistently estimated by the empirical risk . From the structure of the model, it is easy to see that the Bayes’ classifier, i.e. the classifier which minimizes the population risk in this model (over all possible classifiers) is precisely :
[TABLE]
Thus empirically estimates the Bayes classifier. By simple algebra and . Since the former is maximized at and the latter at , it follows that is one particular choice for . Thus, the maximum score estimator is the minimizer of the empirical risk in this classification problem. The rate of estimation of depends on two crucial factors: (1) The manner in which changes across the hyperplane and (2) The distribution of ’s near the hyperplane. If the conditional probability shifts from rather slowly as we move away from the hyperplane, we have a ‘fuzzier’ classification problem and estimation becomes more challenging. On the other hand, the distribution of the ’s governs the density of observed points around the hyperplane, with higher concentration of points being conducive to improved inference. As far as our knowledge goes, there is no work on the high-dimensional aspect of this model, so this paper bridges a gap in the literature.
The multinomial response discrete choice model: This model, which is a natural extension of its binary counterpart, arises in practice when an individual has to choose among finitely many elements, e.g. picking out a movie among several choices proposed by Netflix. In Manski [1975], Manski also proposed an extension of the maximum score estimator for multinomial responses. We first describe the model. Assume that each individual has to choose from many alternatives, for each of which they have a utility value. Denote by , the utility value of the alternative for the individual. Hence, will choose the alternative only if it provides them maximum utility, i.e.
[TABLE]
The utility values are modeled as where is a vector of observable covariates and is an unobservable error. For notational simplicity, define an matrix for individual whose row is , the co-variate corresponding to their utility. The ’s are not observed, but we do observe a multinomial vector for each , where
[TABLE]
In words, this vector indicates which alternative has been chosen by individual . The available data on individuals are therefore the pairs which can be viewed as i.i.d replicates of a random object , with the ’th row of written as . The response vector is related to the unobserved utility vector through the linear model: .
Under certain assumptions on the distribution of (see e.g. Assumption 2 of Manski [1975] or the more relaxed version, Assumption 1 of Fox [2007]), which stipluates that the joint density of conditional on is exchangeable, it can be shown that the probability of choosing the utility is driven by the ordering of the deterministic part of the utility function. This is formalized in the rank ordering property described below.
Assumption 1.1** (Rank ordering property).**
Define as the probability of the product having maximum utility under a generic regression parameter and conditional on being the covariate matrix:
[TABLE]
The rank ordering property says: if and only if . Note that the probability is taken over the joint distribution of given .
This motivates the estimation of the true parameter by maximizing the following score function:
[TABLE]
This is a natural generalization of the maximum score to multinomial responses. The idea is to find a that is most commensurate with the observed data. If is the observed utility for the ’th individual, only the ’th term in the inner sum is relevant, and given this information, we look for that makes the deterministic part of the ’th utility larger than those of most other utilities across all observations. Hence, with enough data, any maximizer of can be expected to be close to with high probability under Assumption 1.1.
We also note that this directly reduces to the binary response model presented at the beginning, when . In this case, there are only two utility values for the individual who chooses the first option only if . Now,
[TABLE]
and hence, taking , and to be a binary response which takes value when item is chosen and [math] otherwise, we recover the binary response model as mentioned in equation (1.1) via a simple linear transformation.
There is a vast literature, especially in economics, which deals with the discrete choice model, although most of it is confined to the binary response model. Lee Lee [1995] extended the analysis of Klein and Spady [1993] for the binary response model to the multinomial case under an appropriate version of the assumptions in the latter paper to obtain a consistent and asymptotically normal semi-parametric efficient estimator. Fox (Fox [2007]) proved the consistency of the maximum score estimator for the multinomial response model under a partially missing data assumption, where the chosen utility and a subset of alternative utilities are observed, without Manski’s assumption of conditionally independent errors (Assumption 2 of Manski [1975]). Recently, Yan (Yan and Yoo [2019]) extended the analysis of Horowitz (Horowitz [1992]) to establish asymptotic normality of a kernel smoothed estimator in the multinomial model.
To the best of our knowledge, all previous work on the binary as well as the multinomial discrete choice model has been done under the setting of fixed dimensional covariates and in the latter model, also under a fixed number of utilities. Our motivation for studying the maximum score estimator in these models is two-fold. Firstly, the score estimator works under very mild conditions on the underlying data generating mechanisms (particularly, through the flexible dependence of the error given the covariate), and is therefore robust to model-misspecification as a consequence of which it has attracted the attention of multiple researchers in both economics and statistics. Through a study of this model in growing dimensions, and results on the concentration properties of the estimator as well as minimax estimation rates in this problem, we provide a novel and interesting direction to the literature on this topic, which we hope will be carried forward by others interested in this genre of problems. Second, from a purely statistical point of view, the score estimator is one of the classic examples of non-regular estimators which arise either through the optimization of criterion functions that are discontinuous in the parameter (note the indicator functions involved in and ), or through optimization problems where the estimator falls on the boundary of the parameter space (e.g., in modern statistical problems involving convex optimization where the estimator lies on a face of a convex cone or more generally a convex set). Such estimators have been known in the literature from as early Chernoff’s work in the 1960s (e.g. see Chernoff [1964]), and were investigated through an integrated approach by Kim and Pollard Kim et al. [1990], in the specific setting of ‘cube-root asymptotics’ – the estimators treated in that paper demonstrated an convergence rate and non-Gaussian limits – and an important example in that paper was the maximum score estimator. There have been a variety of related developments but all work in this arena has also been in the fixed dimension paradigm. Our current study of the score estimator, to the best of our knowledge, is the first example of a systematic study of a non-regular estimator in growing dimensions. While concentration and minimaxity properties have been dealt with quite thoroughly, inferential questions remain open, and we view our contributions as an important foray into hitherto uncharted territory, but we are only scratching the tip of an iceberg.
**Major findings: ** Here we articulate our findings and give a brief description about the organization of the rest of the paper. We note at the outset that the metric is a natural measure of distance in this problem since the angle between two unit-norm vectors, which measures their directional divergence, is a function of the norm of their difference.
Section 2.1 deals with the moderate growth setting i.e. , while Section 2.2 investigates the fast growth regime: . In the moderate growth setting, we establish the rate of convergence of the maximum score estimator in the norm in terms of along with an exponential concentration bound, where is a sequence of constants appearing in Assumption 2.2 assumed non-increasing in terms of . The magnitude of calibrates the difficulty of the estimation problem: sequences with bounded away from 0 present the hardest problems while decreasing to 0 makes the estimation problem easier, which reflects in the convergence rate derived in Theorem 2.6. An elaborate discussion on Assumption 2.2 and comparisons to a standard low noise Assumption (Assumption 2.1) is provided in Section 2. We also establish both minimax lower and upper bounds for estimating and show that the maximum score estimator is minimax optimal up to a log factor. Furthermore, when , which is later argued to be statistically the most interesting regime, we are able to construct an alternative estimator with minimax optimal rate of convergence.
In the regime, we demonstrate that under a sparsity constraint, an appropriate penalized risk minimization method provides a super-set of the active covariates with exponentially high probability. As before, we derive an exponential concentration bound for the penalized maximum score estimator in the norm, which now depends on , the sparsity of , in addition to . Here also, smaller values of translate to improved convergence rates. We derive minimax lower and upper bounds which are again discrepant up to a log factor.
In Section 3 we deal with the multinomial response model. Assumption 1.1 guarantees the uniqueness of the population maximizer, while Assumptions 3.2 and 3.3 are modified versions of Assumption 2.2 and 2.3 tailored for the multinomial response model. Under these modified assumptions, we establish finite sample concentration bounds for the score estimator both in the slowly growing regime and the fast growing regimes. When , our obtained rates of convergence reduce to those obtained for the binary response model in Section 2.
In Section 4, we present some simulation results for the binary choice model. As mentioned earlier, the maximum score estimator can not be computed in polynomial time in the dimension, owing to the discontinuity of the loss function defined previously in this section. A standard approach is to compute an approximate solution by minimizing a convex surrogate of the loss, as is evident from the copious amount of work in both the statistics and machine learning literatures on this topic (see e.g. Friedman et al. [2001]): e.g., logistic regression replaces loss by the logit loss, SVM uses the hinge loss, while adaboost relies on the exponential loss. Another direction involves smoothing the loss via some distribution kernel (which makes the loss function differentiable) and computing the minimizer by some variant of gradient descent. Recently, a homotopic path following approach to this problem has been proposed in Feng et al. [2019]. We present a comparative study of three methods: SVM, logistic regression and the homotopic path following algorithm mentioned above. The main take away from this simulation study is that SVM performs better than logistic regression when under heterogeneity of errors, while the performance of the method proposed in Feng et al. [2019] is comparable to SVM for . As a matter of fact, the method based on homotopic path following performs somewhat better than SVM, but its run-time is also higher.
Section 5 presents a brief discussion of certain aspects of our work including certain natural extensions, some of which are elaborated on in the supplement, as well as future challenges of this direction of research. Section B presents the proofs of two key results while the remaining proofs are relegated to the supplement in the interests of space.
2 Asymptotic properties and minimax bounds
We now present concentration and rate of convergence results for the maximum score estimator in the binary response model in growing dimensions. To that end, we start with some assumptions on the distribution on and the behavior of near the Bayes hyperplane , which play a central role in the subsequent development. To control the behavior of , we introduce a version of Tsybakov’s low noise assumption (Mammen et al. [1999], Tsybakov et al. [2004]) which has been used extensively in the classification literature. For the sake of convenience of the reader we first state the regular low noise condition below.
Assumption 2.1** (Soft margin Assumption).**
Let denote the joint distribution of in dimension . Then, with ,
[TABLE]
for some constant and and .
The soft margin condition quantifies how the conditional class probability deviates from near the Bayes’ hyperplane in terms of a smoothness parameter . Larger values of translate to sharper changes of around the Bayes’ hyperplane and correspond to easier classification problems. For reasons to be explained below, we do not work with the above condition but a slightly tuned version of it:
Assumption 2.2** (Transition condition).**
Let denote the joint distribution of in dimension . Then, with ,
[TABLE]
where is a bounded sequence of constants, and lies strictly between 0 and .
**Discussion of Assumption 2.2: ** To understand the effect of , consider the special case when , a fixed constant. Then, the modified condition is just the low noise condition with smoothness parameter . Next, consider a situation where decreases to 0 with (we view as a function of ). In this case, the transition of from below on one side of the hyperplane to above on the other side is sharper compared to the fixed case, since the probability mass assigned by the covariate distribution to the region where is close to is of a smaller order than with fixed . This translates to an easier estimation problem as grows, and a corresponding improved rate of estimation: the smaller the order of , the faster the rate. In fact, , corresponds to a jump around the Bayes’ hyperplane and a best possible rate of order in fixed dimension. On the other hand, when is large, is substantially larger, which implies the presence of a fair amount of fuzziness near the Bayes’ hyperplane – there is now a substantial mass of points around the hyperplane with values very close to which are hard to classify – resulting in a slower rate of estimation.
The transition condition captures the intrinsic difficulty of the estimation problem in terms of the sequence of constants whereas the low-noise condition describes it in terms of the exponent of , with larger values of corresponding to easier estimation problems (enhanced convergence rates for larger ). Both formulations therefore capture the same phenomenon, albeit in somewhat different manner. Note that, the low noise assumption was originally formulated (Mammen et al. [1999]) to deal with irregular boundaries, whereas, our condition is more naturally tuned to smooth hyperplane boundaries in discrete choice model. Our reason for favoring the modified low noise condition is that it is much more intuitive and allows a clean and integrated presentation of the minimax rates of convergence in this problem in terms of , which does not appear to be the case with the low noise assumption. For a slightly different treatment of this problem under Assumption 2.1, see a previous draft of this manuscript Mukherjee et al. [2019].
We now show that the case in our transition condition arises naturally for a rich family of distributions under some natural assumptions. Observe that, the family of distributions with margin condition involving is a sub-class of the family of distributions with . Assume, for example, that (a) and the density of , say , does not depend on ; (b) on for some ; (c) the density of is bounded by a positive number on for some , with not depending on . Then, for , where :
[TABLE]
which is the condition corresponding .
Using an inverse-Lipschitz type condition, one can also let depend on . Suppose that the conditional distribution of given satisfies:
[TABLE]
for some independent of , for almost surely . This holds, for example, if for almost all , the conditional density on a fixed neighborhood around [math], with not depending on . The transition condition is now satisfied for fixed under the same condition on the density of as before. An example of the dependence requirement of on is .
Our next assumption regarding the marginal distribution of is that the probability of the wedge shaped region between the true hyperplane and any other hyperplane under the distribution of is related to the angle between the corresponding normal vectors.
Assumption 2.3** (Distribution assumption on covariates).**
The distribution of satisfies the following condition:
[TABLE]
for all , where the constant , does not depend on .
**Discussion of Assumption 2.3: **The above assumption plays a critical role in this paper, relating the underlying geometry in the problem to the probability distribution of the covariates. It is used, for example, in the below proposition, to relate the curvature of the population score function around its maximizer to the angle between and a generic unit vector . The magnitude of the curvature plays a pivotal role in deriving the rate of convergence of Manski’s estimator in both the slow and fast growth regimes (Theorems 2.6 and 2.14 respectively), where upper tail probabilities for are related to upper tail probabilities for (which is also the difference in the population risks at these two vectors) via Assumption 2.2. In that respect, this assumption can be viewed as an analogue of the compatibility or restricted eigenvalue condition in the classical high-dimensional linear regression problem, which helps convert bounds on the prediction error of the Lasso estimator to its estimation error. In this context, it is interesting to consider a specific violation of the assumption: namely, when for all sufficiently close to in angular distance. In this case, if for example is supported on a compact domain, it is not difficult to see that one can perturb the Bayes hyperplane by small rotations, but as the corresponding wedges will not have any mass under , there are no points available in such regions, and the Bayes hyperplane cannot be even uniquely identified. Examples of families of distributions (e.g. elliptically symmetric ) that satisfy Assumption 2.3 are available in Section 5.
Proposition 2.4**.**
Under Assumptions 2.2 and 2.3, the curvature of the population score function around the truth satisfies:
[TABLE]
for all , where and are same constants defined in Assumption 2.2.
The proof of this proposition relies on relating to via Assumption 2.2, and the latter to , via Assumption 2.3. One takeaway from the proposition is that the excess risk is lower bounded by a dichotomous distance in terms of and . For close to in the sense that is small relative to , we have a quadratic curvature whose sharpness is determined by the magnitude of , while for away from , the curvature is linear. As we will see below, the dichotomous nature of the distance imposes a natural lower bound on the estimation error of the maximum score estimator, irrespective of how small is.
Remark 2.5**.**
Note that when fixed and assuming without loss of generality , we conclude:
[TABLE]
for all . This same condition can be achieved using Assumption 2.1 with .
2.1 Rate of convergence when:
We first establish a rate of convergence for .
Theorem 2.6**.**
Let and are the maximizer of and respectively. Then under Assumptions 2.2 and 2.3, for some constant (not depending on ):
[TABLE]
for all , where:
[TABLE]
This implies that,
[TABLE]
where is some constant which depends on the model constants introduced in the assumptions and some other universal constants. Note that the supremum in the above display is taken over all distributions corresponding to binary response models satisfying Assumptions 2.2 and 2.3 for some regression parameter (viewed as a functional of ) but with held fixed.
Remark 2.7**.**
Note that our rate of convergence depends on three parameters . To understand the implications of the obtained expression for the rate, assume initially that is a constant (statistically of primary interest as discussed in Section 1), assumed without loss of generality to be 1. Then the value of reduces to:
[TABLE]
and
[TABLE]
*Hence, up to a log factor, we recover the analogue of the cube-root rate for growing dimension.
One may wonder what is the best possible rate that can be obtained from the above expression. An inspection of the rate expression immediately implies that we cannot improve upon , the high dimensional rate analogue of change-point estimation. Some more insight can be gleaned by ignoring the log-factor in the rate expression. In that case,
[TABLE]
and the rate of convergence based on this approximation is given by
[TABLE]
which is shown to be the minimax optimal in Theorem 2.11. The above equality follows from the observation that, if , then is minimum among the four terms, while is the minimum otherwise. This indicates that the rate of convergence improves with decreasing , but only up to modulo a log factor.
Alternatively, one can study the exact expression for the rate by taking special but natural choices for in terms of . Concretely, let and for and . Note that is of order larger than when , in which case some simple algebra shows the rate of convergence to be . On the other hand when , i.e. is of the same or lower order than , the rate of convergence becomes .
The proof of Theorem 2.6 relies on a concentration inequality (Theorem 2 from Massart et al. [2006]) to obtain a bound on the excess risk , which, along with Assumption 2.3, yields a concentration bound on . A natural question that arises here is whether the logarithm in the above rate, which arises from the effect of growing dimension on the shattering numbers of the linear classifiers involved, can be dispensed with. While it is unclear whether the exact rate is achievable, we demonstrate, in what follows, that for , it is possible to construct an estimator whose rate of convergence is under the following additional assumption.
Assumption 2.8**.**
We impose some further constraints on the distribution of and the population score function:
The distribution of satisfies
[TABLE]
for all , where the constant does not depend on and . 2. 2.
For some small ,
[TABLE]
for all , where the constant does not depend on and .
The construction of the estimator can be briefly described as follows: Generate (enough) points randomly on the surface of the unit sphere, such that with high probability some of the generated points are in a sufficiently small neighborhood of . Then, maximize the empirical score function on the generated points. We show in the following theorem that this empirical maximizer converges to the truth at rate :
Theorem 2.9**.**
Suppose the margin condition (Assumption 2.2) is satisfied for fixed, and that Assumptions 2.3 and 2.8 hold. Then, there exists an estimator , which can be constructed by the above recipe [with technical details of the construction available in the proof], such that
[TABLE]
Remark 2.10**.**
Assumption 2.8 (2) as well as the construction of the grid estimator take into account the the fact that fixed. In that sense, the new estimator is not adaptive, whereas the maximum score estimator is agnostic to the value of . We believe that the log factor in the convergence rate is the price paid for adaptivity. For more insight into the Assumption 2.8, see Section 5.
Finally, we show that the generic minimax lower bound for this estimation problem (i.e the ’s are not restricted to be constant) is i.e. we cannot estimate the linear discriminator at a better rate without more assumptions:
Theorem 2.11** (Minimax Lower bound).**
We have :
[TABLE]
for some constant that does not depend on . For fixed, the lower bound is of the order . The supremum is taken over the same class of distributions as in Theorem 2.6.
Remark 2.12**.**
The proof of the above result relies on constructing competing models from the collection of distributions that approach each other at the optimal rate, . The core challenge lies in constructing these alternative models with sufficient care, and then invoking Assouad’s lemma (e.g. see chapter 2 of Tsybakov [2009]) to establish the rate. The same minimax rate is true for the smaller class of distributions formed by intersecting with the class of distributions satisfying Assumption 2.8 for some positive constants , since the local alternatives constructed in the proof satisfy this assumption as well. Therefore, the grid estimator is minimax rate optimal for this smaller class of distributions.
2.2 Rate of convergence when
We now turn to the case where , the dimension of the covariate vector, is larger than . In this case, meaningful estimation and inference is only possible under structural assumptions on that regulate its complexity relative to the size of the data, and any meaningful estimation procedure needs to incorporate this constraint. Usually, such structural assumptions are handled by imposing a penalty on the underlying loss function. The most natural structural constraint on a regression type parameter is one of sparsity, i.e. only a small subset of the co-ordinates of influence the response (i.e. are different from zero). In the high-dimensional linear regression or GLM framework, the natural loss function is convex and the standard approach is to penalize the (convex) norm of the parameter, which gives rise to a clean convex optimization problem with a well-characterized solution (see Tibshirani [1996], Greenshtein et al. [2004], Van de Geer et al. [2014], Bühlmann and Van De Geer [2011], Bickel et al. [2009], Miolane and Montanari [2018] and references therein). The corresponding optimizers are seen to have desirable statistical properties, e.g. consistency in various norms, minimax convergence rates and so-forth. Furthermore, penalization is a natural convex relaxation of penalization which is the most direct approach to the sparsity constraint. Another key feature of high dimensional inference is model selection. Under the sparsity constraint, most variables are inactive and a good model selection algorithm needs to include the active set with high probability but relatively few inactive variables. Though model selection/ feature selection in the high-dimensional linear regression model has been studied extensively over the past two decades (e.g. see Zhao and Yu [2006], Huang et al. [2008], Wei and Huang [2010], Yuan and Lin [2006], Zhang et al. [2008] and references therein), the problem remains relatively unaddressed in the classification set-up.
Be that as it may, the optimization problem that produces the maximum score estimator is not only non-differentiable and non-convex, it is actually discontinuous and therefore adding a convex penalty like affords no computational advantage. While one possible route is to use an penalized version of a kernel smoothed loss function (following the line of work of Horowitz Horowitz [1992]), an approach that has recently been adapted by Feng et al. [2019] in related problems, our goal in this section is to understand the behavior of the primal non-regular score estimator in high dimensions under minimal assumptions. In what follows, we therefore penalize the score function in a way that is amenable to a proper analysis and produces a sparse estimator with near-optimal rate and a desirable screening property. A smoothed estimator can possibly yield a better convergence rate along with computational benefits but will require substantially stronger assumptions on the model. Recall, for example, that the smoothed score estimator in the fixed setting as studied in Horowitz [1992] does converge at a faster than rate to a Gaussian limit, but the model assumptions required to make this work are significantly stronger than Manski’s original assumptions as well as Kim and Pollard’s Kim et al. [1990].
In what follows, we use the structural risk minimization (SRM) approach introduced in Vapnik and Chervonenkis [1974] for variable selection and estimation in this regime, which is closely related to -penalized risk minimization or the best subset selection problem. Briefly speaking, the SRM approach consists of the following steps:
Start with a large class of functions over which the loss function will be minimized. 2. 2.
Divide this class into nested subsets of increasing complexity, and find empirical risk minimizer for each of these subsets. 3. 3.
Add a penalty (here denoted by pen) based on the complexity of the subclass to the minimum empirical risk for that subclass and return the classifier (and its corresponding subclass) with minimum penalized empirical risk.
The first step generally ensures that there is no bias (or very low bias) in the estimation problem. If one starts with a large function class, it is more likely that the population minimizer will be close (if not identical to) the minimizer within the selected class. But though bias can be largely eliminated by in this manner, the process of searching over a large function class incurs high variability and can lead to pessimistic convergence rates. Therefore, one needs to optimize the bias-variance trade-off, which happens over steps two and three. In step two, nested subsets are considered, hence the minimum value of the empirical risk keeps decreasing as the nesting (complexity) increases. The role of the penalty function is to stabilize the bias-variance trade-off and strike a balance between risk minimization and complexity. The nature of the penalty is typically related to the complexity of the class of functions (complex classes are penalized at higher levels) as well as to the structure of the problem. For parametrically specified classes, one may use the or a more general norm of the parameter, or variants (e.g. Mallow’s CP, AIC, BIC) as a notion of complexity, or may resort to other notions like VC dimension (see e.g. Chapter 8 of Massart and the references therein).
We now describe the details of the implementation of our SRM based method. We start by articulating our assumption on the sparsity of the Bayes’ hyperplane:
Assumption 2.13** (Sparsity Assumption).**
There exists with , where depends on in such a way that as .
Under the above assumption, it is reasonable to search among all models with sparsity (by which we mean the number of active coefficients) bounded by for some universal constant . For mathematical simplicity, we take . Let be the collection of all models with sparsity bounded by for , i.e.:
[TABLE]
Define be the collection of all admissible models, i.e.
[TABLE]
Also define:
2. 2.
3. 3.
4. 4.
VC dimension of the collection . This is of the order .
By the SRM principle, the best possible estimate is given by . For the model collection , we use the penalty
[TABLE]
where is some absolute constant. Up to a (neligible) logarithmic term, the penalty function is proportional to , the VC dimension of the model , which captures the richness of this collection. The following theorem provides a finite sample concentration bound of our estimator:
Theorem 2.14**.**
Let and denote the penalized empirical minimizer and population minimizer of the binary choice model respectively. Then under assumptions 2.2, 2.3 and 2.13, there exist constants (which are independent of ) such that for all :
[TABLE]
where,
[TABLE]
and is a specific sequence of constants going down to 0 (with details available in the proof).
As a consequence of the exponential tail bound, one can establish the following upper bound on the minimax risk:
[TABLE]
where the supremum in the above display is taken over all distributions corresponding to binary response models satisfying Assumptions 2.2, 2.3 and 2.13 with some regression parameter (viewed as a functional of ) with norm bounded below by .
Remark 2.15**.**
A discussion similar to Remark 2.7 is in order. Here, the rate of convergence depends on four parameters . Assuming , the value of becomes:
[TABLE]
and recalling that ,
[TABLE]
*where the last asymptotic equivalence while not always being true, nonetheless holds for most common scenarios. As an example, if we take for some with , the equivalence is valid. The condition is forced by Assumption 2.13.
It is immediate that the rate of convergence of our estimator cannot be faster than . As before, one can gain useful insights by ignoring the log-factors in the rate expression. Thus,
[TABLE]
and hence
[TABLE]
As in the case of slowly growing regime, this rate is also shown to be minimax optimal in Theorem 2.18. The last equality follows from the fact that, if , then is the minimum among the four terms, otherwise is the minimum. This implies that the rate can be made faster by decreasing the value of , but cannot be improved upon (up to log factors).
As an immediate corollary of the above theorem, we establish that a superset of true model will be selected with high probability under an appropriate beta-min condition:
Corollary 2.16**.**
Suppose the minimum non-zero absolute value of satisfies the following bound:
[TABLE]
with is as in Theorem 2.14. Then under Assumptions 2.2,2.3 and 2.13, we have:
[TABLE]
where is the true active set. This probability goes to 1 exponentially fast in .
Remark 2.17**.**
Note that for fixed, the lower bound on is proportional to which is same as the rate of convergence of (see Theorem 2.14). This should be compared to the condition derived from the convergence analysis in high dimensional linear regression which is . The slower convergence rate in this problem requires a more pronounced separation of the active coefficients of from the inactive ones in comparison to standard linear regression, to guarantee the screening property.
Our next result provides a lower bound on the minimax error rate.
Theorem 2.18**.**
We present our minimax lower bound result for the fast growth regime :
[TABLE]
for some constant not depending on . For the case fixed, the lower bound is of the order of . The supremum is taken over the same class of distributions as in Theorem 2.14.
Remark 2.19**.**
As with the minimax lower bound proof in the moderate growth case, the proof of this theorem also relies on the construction of a sequence of competing models that approach one another, along with Fano’s inequality (Chapter 2 of Tsybakov [2009]). Similar to the moderate growth regime, by comparing the lower bound above to the rate of the score estimator in Remark 2.15, we find that the former is better only by a logarithmic factor, which suggests that the penalized maximum score estimator is almost minimax optimal.
3 Multinomial discrete choice model
The score function for the multinomial discrete choice model, as explained in the Section 1, is given by:
[TABLE]
and the corresponding population version is given by:
[TABLE]
where and denotes the number of choices.
In what follows, and should be viewed as growing as functions of . The following proposition establishes that is indeed the unique maximizer of the population score function:
Proposition 3.1**.**
Under Assumption 1.1 we have, , and that the maximizer is unique.
Let denote a maximizer of . This is a slight abuse of notation as also is used to indicate the ERM estimator for the binary counterpart of this model. In this section, will unambiguously denote the ERM estimator for the multinomial choice model. We state a set of further assumptions (which should be viewed as natural extensions of our assumptions for the binary case) to facilitate the asymptotic analysis of .
Assumption 3.2** (Transition condition).**
The multinomial choice model satisfies the modified transition condition uniformly for all pairs , i.e. there exists constants (not depending on ) such that for every :
[TABLE]
for all . We assume for mathematical simplicity.
Assumption 3.3** (Restricted wedge assumption).**
There exist constants and such that:
* for all for all , where denotes Frobenius norm. Here the constant depends , while the radius of choice does not depend on the specific utility, but may or may not depend on . * 2. 2.
*For all pairs , where does not depend on . * 3. 3.
The effect of radius is asymptotically non-vanishing, i.e. for all pairs :
[TABLE]
and the constant does not depend on .
Remark 3.4**.**
Assumption 3.2 should be viewed as the multinomial version of Assumption 2.2. It quantifies the probability mass of the covariate space where the magnitude of the difference between and is small relative to their sum, in terms of a generic threshold . This is easily seen by noting that . The smaller this quantity, the harder it is to differentiate between utilities and . We note that for the multinomial problem we confine ourselves to a fixed (as opposed to a general sequence ) in our low-noise assumption which allows a cleaner and less cumbersome presentation of our results. As in the binary case, the fixed assumption is statistically the most interesting version. The proof for a general that goes to 0 would work similarly as for fixed , except for the fact that we would now need to keep explicit track of the throughout the steps of the proof.
Remark 3.5**.**
It is clear that Assumption 3.3 is in similar vein to Assumption 2.3 for the binary response model, albeit somewhat more involved owing to the multinomial structure. Part (1) of Assumption 3.3 postulates a ball of radius around the origin in where the probability of choosing any specific utility given is bounded away from 0: i.e., every alternative can be chosen with non-negligible probability, or in other words, all utilities are competitive. Part (2) of the assumption resembles Assumption 2.3 exactly, modulo the fact that we are now interested in the wedge-shaped region within the ball of radius . This is because part (1) of the Assumption restricts the main action to the ball of radius where the probability of choosing any item is non-negligible. Part (3) of the Assumption ensures that, the probability of the wedge-shaped region intersected with a ball of radius is not negligible with respect to the probability of the entire wedge-shaped region. In other words, the region of primary action is non-ignorable with respect to the entire region. This assumption helps us establish an upper bound on the variability of the empirical process relevant to our analysis of the concentration bounds for the estimator.
Remark 3.6**.**
The results in this section presented below can also be derived by taking in Part (1) of Assumption 3.3. In this case, Part (1) becomes stronger as we now assume the lower bound on the conditional probabilities of choosing utilities for all . On the other hand, Part (2) of the assumption is weakened: if the lower bound in Part (2) holds for finite , it holds for . Part (3) of the assumption is trivially satisfied for with .
Remark 3.7**.**
The assumption that the conditional probability of each utility is bounded away from 0 (Part (1) of Assumption 3.3) can be easily relaxed. For example one may assume that for all for all without disturbing any of our calculations. Indeed, an inspection of the proof of Proposition 3.8 shows that what we crucially require to establish the curvature of is a lower bound on for , and this is obviously true under the relaxed assumption. For the binary choice model this weaker assumption is automatic: for all , so that we can clearly take and Assumption 3.3 boils down to Assumption 2.3.
The following Proposition (similar to Proposition 2.4) establishes a lower bound on the excess population risk:
Proposition 3.8**.**
Under Assumptions 1.1, 3.2 and 3.3, we have the following curvature condition for multinomial choice model:
[TABLE]
for all where are same constants as mentioned in Assumption 3.2 and 3.3.
The proof of the above proposition is conceptually similar to that of Proposition 2.4, as it relies on relating to the average of the probabilities of truncated wedge-shaped regions for all possible pairs of utilities. Note that for , this corresponds to a single wedge-shaped region. The average probability is then bounded using Part (2) of Assumption 3.3 to conclude the proof.
Theorem 3.9** (When ).**
If , then under Assumptions 1.1, 3.2 and 3.3, we have:
[TABLE]
for all , where
[TABLE]
and is the same constant defined in Assumption 3.3, is some constant which does not depend on .
Remark 3.10**.**
*It is instructive to relate this theorem with its counterpart for the binary choice model, Theorem 2.6. For the case , it is clear from Remark 3.7 that we can always take , and the rate of convergence becomes which is identical to that from Theorem 2.6 when is fixed (See remark 2.7). The additional term in the current rate, viz. can be viewed as an adjustment for the number of utilities. Notice that itself depends non-trivially on : as , we need for Part 1 of Assumption 3.3 to make sense. *
Next, we present our result for the fast growth regime, i.e. when . As before, we require a sparsity assumption for the identification of the model. Our following assumption encodes the rate at which we can allow the sparsity to grow for our asymptotic analysis:
Assumption 3.11** (Sparsity condition for Multinomial model).**
Under the fast growth regime, i.e. when , we assume that there exists with which satisfies:
[TABLE]
as .
This assumption is identical in spirit to Assumption 2.13. The only difference is that now both and play a role in determining the permissible rate of sparsity of the true vector . As mentioned before, the factor appears due to pairwise comparison of utilities and the factor relates to the curvature condition established in Proposition 3.8.
Theorem 3.12** (When ).**
Under Assumptions 1.1, 3.2 and 3.3, there exists a constant (not depending on ) such that for all :
[TABLE]
where,
[TABLE]
*and as . *
Remark 3.13**.**
Recall from Remark 3.10, when , one can always take . The rate of convergence then becomes (similar to the rate obtained in Theorem 2.14 when is fixed), which can be further simplified to under the specific choices of taken in Remark 2.15. As in the moderate growth regime, the additional term in above is the adjustment for the growing number of utilities.
4 Computational Aspects
In this section we investigate the performance of a number of procedures employ for estimating in the binary choice model and compare their performances. Specifically, we consider the following three methods:
Logistic regression. 2. 2.
Support Vector Machine. 3. 3.
Homotopy path-following framework adapted in Feng et al. [2019].
We divide our simulation studies into two sections: the slowly growing regime, i.e. and the fast growing regime i.e. . The algorithm based on homotopy path-following framework described in Feng et al. [2019] is tailored to the scenario , hence we only compare SVM and logistic regression under the slowly growing regime, and all three methods (with the penalized versions of SVM (see Zhu et al. [2004]) and logistic regression) when . Our primary data generation mechanism is common to both regimes (with slight changes in the to accommodate sparsity considerations) and is described below:
Generation of the true : For the regime , each entry of is generated from the distribution and then normalized to make its norm 1. For the regime , each of the active entries is generated randomly from and then normalized to keep . This remains fixed over all monte-carlo iterations. 2. 2.
Generate from where the dispersion matrix has the following form:
[TABLE]
where we take for our simulations. 3. 3.
We generate the co-variate dependent errors as follows: for ,
[TABLE]
where . 4. 4.
Finally we set for all . (or one can set depending on how one wants to encode the binary variable).
The idea behind this model is that, the data close to the boundary are more informative than the data far away. Note also, that when the variability of the error near the boundary is low, estimation is a relatively easy task. Hence, to challenge the existing methods, we assume non-negligible error variance near the boundary. For the simulation setting above, for a point near the boundary, i.e. , the (conditional) variance of the error is , and as one moves to points away from the boundary, the (conditional) variability of the response increases depending on their distance from the true hyperplane.
4.1 Estimation error
We explore three different growth patterns of relative to :
[TABLE]
where is the floor function. The sample size ranges as: .
Consider first the performance of SVM on the generated data. Below are three density plots of scaled estimation error based on 500 monte-carlo iterations:
As is evident from the density plots above, the distribution of the normalized errors is quite stable across the different values of suggesting that the SVM method is giving a quite decent approximation to the actual score estimator. We next apply simple logistic regression for estimating . As we except, this does not perform as well as SVM owing to model mis-specification. The reason we study logistic regression is because it is typically the bread and butter option for dealing with binary response regression, but as we see below is quite suspect in this situation. Below are the plots from logistic regression:
It is quite clear from the plots that the scaled error is not converging with and behaves in a rather erratic manner, and the SVM algorithm is markedly superior. Further investigation into SVM based algorithms in this and similar models can constitute a potentially interesting topic for future research.
4.2 Model selection and estimation when
As mentioned in the Introduction, our problem can be viewed as a binary classification problem with a linear Bayes’ classifier. Under the sparsity assumption, only a few covariates contribute to the classification. To identify these covariates, we resort to a penalized classification approach. We here employ three methods:
penalized SVM. 2. 2.
penalized logistic regression. 3. 3.
Homotopy path-following framework adapted in Feng et al. [2019].
Recall that the data generating mechanism has already been described. We take for our simulations and use five different values of . Our goal is to investigate how the performance of the classifier changes as we increase keeping and fixed. The penalty parameters for logistic regression and SVM are selected using grid search and two-fold cross-validation. We also implement the algorithm based on homotopy path-following framework adapted in Feng et al. [2019]. We assess the performances of these three approaches based on the following discrepancy measures:
Misclassification error. 2. 2.
Norm difference between and . 3. 3.
No. of true active variables not selected (denoted by Type 2 error). 4. 4.
No. of true null variables selected (denoted by Type 1 error).
The following plots provide a visual representation of the comparisons for different values of sparsity and across the three methods.
It is clear that logistic regression is generally outperfomed by some other method in this model. The performance of SVM and the algorithm proposed in Feng et al. [2019] are generally at par, though from eye-inspection, the later seems superior. For example, in terms of mis-classification error and norm-difference, their performance is similar; in some cases algorithm of Feng et al. [2019] performs better than SVM, while in other cases SVM wins marginally. Type 2 error (proportion of true active variables missed by the method) is generally lower for the algorithm in Feng et al. [2019] when compared to SVM, whilst Type 1 error (proportion of null variables declared active) is generally higher: SVM is more conservative in terms of selecting variables.
Depending on one’s priorities, one may weigh Type 1 and Type 2 errors differently to generate a weighted misclassification error. In the absence of any such information, it is natural to assign equal weights, which leads to the sum of thes two errors, as shown in the following plot:
We find that the algorithm proposed in Feng et al. [2019] better under this metric especially for large which is explained by tendency of SVM not selecting enough active variables.
Based on our study, it appears that one is better off with the algorithm proposed in Feng et al. [2019] in the scenario, though, of course much larger scale simulations would be necessary to make any general recommendations. As the focus of our paper is largely theoretical, we do not develop these studies any further but note that a thorough investigation of computationally feasible methods in this and related problems involving optimization of discontinuous functions along with analytical assessments of their performance constitutes an open direction of research.
5 Concluding Discussion
We close with a discussion of various aspects of the high-dimensional binary choice model and our approach to the problem.
5.1 Exploring and relaxing our assumptions:
It is of interest to investigate sufficient conditions under which Assumption 2.3 and Assumption 2.8 hold. We show in Lemma B.1 in the supplement that these two assumptions hold simultaneously when arises from an elliptically symmetric distribution centered at 0, under some restrictions on the minimum and maximum eigenvalues of its orientation matrix. Assumption 2.8 also holds for elliptically symmetric distributions centered at 0 but under some further mild conditions, as demonstrated in Lemma B.2.
5.2 Model with intercept:
Our treatment thus far has considered a model of the form for a random . However, many practical scenarios necessitate the inclusion of an intercept term where the term is replaced by . Assumption 2.3 then naturally generalizes to
[TABLE]
However, we cannot expect this to be satisfied for all with when varies in an unconstrained manner. Consider for example the case that so that and are both standard normal. In this case, when and are very large, the signs of and are primarily driven by the magnitudes of and , so that if these two parameters have sign, the probability of the signs being different can be made as small as one pleases depending on the magnitudes of the ’s. This entails controlling the magnitudes of the ’s relative to the ’s; in particular, if the absolute magnitudes of the ’s are kept bounded away from , a restricted version of Assumption 2.3, in the sense that the inequality in Assumption 2.3 is fulfilled for all sufficiently close to , is, indeed, verifiable for certain families of distributions including elliptically symmetric centered at the origin, as well as ’s with independent components where each component has a symmetric log-concave density with mode at 0. The convergence and minimax lower bound results established in this paper still continue to hold, but to accommodate the restricted version of this assumption, the proofs presented in the supplement need to be slightly modified. An elaborate and rigorous discussion of such models with intercept is available in section C of the supplement.
5.3 Asymptotic distribution
In their seminal paper, Kim and Pollard Kim et al. [1990] proved that for fixed , converges in distribution to the maximizer of a Gaussian process with quadratic drift. Our treatment of the binary choice model should be contrasted with their approach: while they assumed the continuous differentiability of both the density of and and a compact support for , we have made no such assumptions. We have tackled those aspects of this problem from the classification point of view, with assumptions on the growth of near the Bayes hyperplane and in addition, conditions on the distribution of to ensure that sufficiently many observations are available around the Bayes hyperplane. As far as the asymptotic distribution of the score estimator in growing dimensions (or functionals thereof) is concerned, this is, in itself, a mathematically formidable problem, well outside the scope of this paper. Based on what we know in the fixed setting, the forms of such distributions are likely to be extremely complicated. The question remains whether tractable asymptotic distributions for making inference on components of in the growing setting could be obtained for smoothed versions of the score estimator, in the spirit of Horowitz’s paper Horowitz [1992]. This is likely to be an interesting but challenging avenue for future research on this subject.
6 Selected Proofs
6.1 Proof of Theorem 2.9
We generate points uniformly from the surface of the sphere (where will be chosen later), maximize the empirical score function over these selected points and show that the maximizer achieves the desired rate. Define and to be the collection of points generated uniformly.
We start with the following technical lemma that plays a key role in the proof.
Lemma 6.1**.**
Suppose denotes a spherical cap around of radius , i.e.
[TABLE]
Then we have
[TABLE]
for and , where is the uniform measure on the sphere, i.e. the proportion of the surface of the spherical cap to the surface area of the sphere.
For a brief discussion on this Lemma, see section B.11. The next lemma shows that we can find at least one point in our collection which is within a distance of of with probability .
Lemma 6.2**.**
Let denote the event that there exists at least one such that . Then .
Proof.
Using Lemma 6.1 we have the following bound:
[TABLE]
∎
Let denote the point closest to . On , . To establish the convergence rate, we will use a specific version of the shelling argument. Fix , sufficiently large. (In fact, as we work our way through the proof we will keep enhancing the value of as and when necessary, but as this will be done finitely many times, it won’t have a bearing on the rate of convergence.) Consider shells around the true parameter , where
[TABLE]
with , for and . We will compute an upper bound on the number of elements of for all .
Lemma 6.3**.**
For all ,
[TABLE]
with exponentially high probability where .
Proof.
Let denote the number of points in . Then where . For , . So . Hence we will only confine ourselves to the case . In this case, and hence from Lemma 6.1 we have . From the Chernoff tail bound for the Binomial distribution we have, for each : , where
[TABLE]
is the Kullback-Liebler divergence between Bernoulli and Bernoulli. This can be lower bounded thus:
[TABLE]
Using this upper bound we have:
[TABLE]
∎
Define for and let . The following lemma says that the event happens with high probability:
Lemma 6.4**.**
For any , as .
Proof.
It is enough to show that as . We have already established in Lemma 6.2 that as . Using Lemma 6.3:
[TABLE]
Now for any fixed , the maximum term obtains when i.e. which goes to 0 for . Furthermore, the series under consideration is easily dominated by for some constant , which is clearly finite. Hence the series on the right-side of the above display goes to 0 with increasing . ∎
The rest of the analysis will be done conditioning on the event . Define . Then we have:
[TABLE]
Since as , we omit this term henceforth. Next, we analyze a general summand. Define . Then and assumes values . Also, . Using Proposition 2.4 and Assumption 2.8 we have:
[TABLE]
for . This implies has high probability of being negative. We exploit this to prove the concentration. To simplify the calculations, define be to be a collection of independent random variables with
[TABLE]
Hence the expectation of is:
[TABLE]
For the rest of the calculations we need to bound . Towards that direction we have the following:
[TABLE]
for . For the lower bound we have:
[TABLE]
when , where we are also using the fact that . Putting the upper bound in equation (6.1) we have:
[TABLE]
So, if (say) then which implies . Define . Then . Let denote the number of non-zero ’s. Then where .
[TABLE]
Thus we can take to ignore the effect of . Putting this back in equation (1) we get:
[TABLE]
for large enough (by using similar arguments to the one used for handling the earlier series) which proves the theorem.
6.2 Proof of Theorem 2.18
We use Fano’s inequality along with the Gilbert-Varshamov Lemma to prove the minimax lower bound. Fano’s inequality (or Local-Fano’s inequality) gives us a lower bound on the minimax risk as follows: If is a finite packing set, i.e. for any two , with , then, based on i.i.d. samples we have the following minimax lower bound:
[TABLE]
The crux of the proof relies on constructing competing models that approach each other at an optimal rate, as increases. We start with a preliminary lemma.
Lemma 6.5**.**
If and and if , then
The proof of the Lemma appears in supplement in section B.13. We next state the Gilbert-Varshamov Lemma for convenience (see Raskutti et al. [2011] and references therein), that guides the construction of in our problem.
Lemma 6.6** **(Gilbert-Varshamov).
Define to be the Hamming distance, i.e. with being the underlying dimension. Given any with , we can find such that:
- a)
. 2. b)
** 3. c)
.
Fix . To construct a packing set of , consider the following vectors:
[TABLE]
where , a subset of constructed using GV lemma. Let . For ,
[TABLE]
For notational simplicity define and . Fix . Denote as the joint distribution of where and
[TABLE]
for all . Now, for any , we have where . As by construction, we know where and independent of . Thus, we have .
Lemma 6.7**.**
The above family of distributions satisfy the margin assumption (Assumption 2.2) for all .
Proof.
Fix any .
[TABLE]
∎
Define the event . Then we have the following lemma:
Lemma 6.8**.**
If , then
[TABLE]
The proof follows the same arguments as that of Lemma B.7 and is skipped. Next, we upper-bound the KL divergence:
Lemma 6.9**.**
For any , we have
[TABLE]
where is the standard normal density.
Proof.
[TABLE]
We analyze each summand separately, starting with .
[TABLE]
Now, on to :
[TABLE]
Combining equations 6.2, 6.3 and 6.4 we conclude that:
[TABLE]
∎
The final step is a direct application of Fano’s inequality. According to our construction, is a packing set with . For notational simplicity, set
[TABLE]
The upper bound on the KL divergences, in conjunction with Fano’s inequality, gives:
[TABLE]
Taking , then we have:
[TABLE]
the last inequality holding true when , which is true for all large as .
The other inequality (i.e. we cannot estimate at a better rate than ), essentially follows from the same argument with taking . We skip the details here for the sake of brevity.
Appendix A Some important results
In this section we state some results from the existing literature for the convenience of the readers which we use in our proofs. Theorem A.1 is Theorem 2 of Massart et al. [2006] which provides some exponential concentration bound on the ERM estimators for bounded loss functions. Lemma A.2 is a classical maximal inequality, which is used to bound the fluctuations of an empirical process. A simple proof of this Lemma can be found in Massart et al. [2006]. Theorem A.3 is a modified version of Theorem 8.5 of Massart , which we use for our model selection consistency results in case of . We provide the proof of Theorem A.3 in this supplement. Theorem A.4 is a version of Talagrand’s inequality (also known as Bousquet’s version of Talagrand inequality, see Bousquet [2002]) which we use to prove Theorem A.3.
Theorem A.1**.**
Let be i.i.d. observations taking values in the sample space and let be a class of real-valued functions defined on . Let be a loss function, and suppose that uniquely minimizes the expected loss function over . Define the empirical risk as , and . Let be the excess risk. Consider a pseudo-distance on satisfying . Finally, let be the collection of all functions such that, is non-decreasing, continuous with is non-increasing on and . Assume that:
- (1)
There exists and a countable subset , such that for each , there is a sequence of elements of satisfying as , for every . 2. (2)
, for some function . 3. (3)
For every
[TABLE]
for every such that , where .
Let be the unique positive solution of . Let be the (empirical) minimizer of over and .Then, there exists an absolute constant such that for all , the following inequality holds:
[TABLE]
Lemma A.2** (A maximal inequality for weighted empirical process).**
Let be a countable set, and such that . Let be a process indexed by and assume that the non-negative random variable has finite expectation for any positive number , where . Let be a non-negative function on such that is non-increasing on and satisfies for some positive number :
[TABLE]
Then, one has, for any positive number ,
[TABLE]
Theorem A.3** (Model selection consistency).**
Let be independent observations taking their values in the measurable space with common distribution . Let be some set, , be a measurable function such that for every , is measurable. Assume that there exists some minimizer of over and define as the excess risk:
[TABLE]
for every . Let be the empirical risk:
[TABLE]
and be corresponding centered empirical process defined by
[TABLE]
Let be some psuedo-distance on such that
[TABLE]
Let be some, at most, countable collection of subsets of , each model admitting some countable subset such that for every , there exists some sequence of elements of satisfying as , for every . Let and belong to class of functions (defined in Theorem A.1) for all . Assume one hand
[TABLE]
and on the other hand one has for every and :
[TABLE]
for every positive such that . Let be the unique solution of the equation:
[TABLE]
with . Let be the empirical minimizer:
[TABLE]
and be some family of nonnegative weights such that
[TABLE]
Consider a penalty function pen: such that for every ,
[TABLE]
for some judiciously chosen constant . Define the chosen model as , i.e.:
[TABLE]
Also, define and . Then the penalized estimator satisfies the following inequality:
[TABLE]
where the constants depend on . This immediately implies:
[TABLE]
for some constant depending on .
Theorem A.4** (Bousquet’s version of Talagrand inequality).**
Let be a countable family of measurable functions such that for some positive constants one has for all , and . Then for all :
[TABLE]
where .
Remark A.5**.**
The result above extends to an uncountable family if there exists a countable with the property that for every , there is a sequence belonging to such that pointwise. This is indeed the case for all applications of this result in our paper.
Appendix B Proofs of Theorems and Lemmas
B.0.1 Proof of Proposition 2.4
To prove Proposition 2.4 at first we relate the excess risk to . Define for notational simplicity:
[TABLE]
for . We have,
[TABLE]
A straightforward derivative calculation implies that the suprema is attained at if and at otherwise. Hence we conclude:
[TABLE]
Combining this with Assumption 2.3 we conclude:
[TABLE]
which completes the proof.
B.1 Some sufficient conditions for Assumptions 2.3 and 2.8
In this subsection we provide some sufficient conditions for Assumption 2.3 and 2.8. We break the analysis into two lemmas. Lemma B.1 below exerts some sufficient conditions for Assumption 2.3 and part (i) of Assumption 2.8. Lemma B.2 yields sufficient conditions for part (ii) of Assumption 2.8.
Lemma B.1**.**
Suppose that follows an elliptically symmetric distribution centered at 0, with density , where is a non-negative function. Assume that:
[TABLE]
where does not depend on . Then satisfies Assumption 2.3 and part (i) of Assumption 2.8.
Proof.
First, we prove that for with the above displayed condition holding. Observe that depends on the two-dimensional geometry of , i.e. only on the distribution of . To make the calculations easier, we transform into where the first two-coordinates of corresponds to . Consider the following orthogonal matrix:
[TABLE]
where forms an orthonormal basis of (For example the vectors can be constructed using the Gram-Schimdt algorithm). If we define , then and where . Then the probability of the wedge shaped region becomes:
[TABLE]
Now, for . Hence we get,
[TABLE]
Using this we obtain,
[TABLE]
It can be easily seen (i.e. by differentiating) that the function is an increasing function of for . More precisely, observing that
[TABLE]
we conclude, for :
[TABLE]
In conjunction with B.1, this gives:
[TABLE]
Finally using the fact that
[TABLE]
we have . Combining these, we conclude:
[TABLE]
Now, on to general . By our assumption on in the statement of the lemma, i.e. for some spherically symmetric random variable . We know
[TABLE]
for some with . Using the relation we have:
[TABLE]
which again falls back to situation. The upper bound can be established via a similar calculation, where we need a finite upper bound on : this is given by . ∎
Lemma B.2**.**
Assume that the function satisfies that
[TABLE]
for some constant a.e. with respect to the measure of and the distribution of follows a consistent family of elliptical distribution with . Also assume that (the density component corresponding to the two dimensional marginal of ) is a decreasing function on and the eigenvalues of orientation matrix satisfies:
[TABLE]
for all . Then, under part (i) of Assumption 2.8 we have
[TABLE]
for all where for some constant defined in the proof.
Proof.
As in the proof of proposition 2.4 we have (with the same notation):
[TABLE]
∎
Remark B.3**.**
The Lipschitz type condition i.e. controls how the function varies around the true hyperplane. This condition is easily satisfied if we assume that the conditional density of given has an uniform upper bound over all and dimension. Note that the two conditions in the above lemma and Assumption 2.8 are readily satisfied, for example, for a broad class of elliptically symmetric densities centered at 0.
B.2 Proof of Theorem 2.6
In this proof, will denote a generic constant (not depending on ) which may change from line to line. We use Theorem A.1 to establish the rate of convergence of the maximum score estimator. In our problem, the set of classifiers
[TABLE]
and . We define the following affine transformations of our score functions:
. 2. 2.
. 3. 3.
. 4. 4.
.
Also, note that in Theorem A.1 is in our situation and the excess risk is . Next we argue that the assumptions of Theorem A.1 hold in our situation. For the first assumption, take and take where is a countable dense subset of . It is easy to check that the convergence criterion in condition (1) of Theorem A.1 is satisfied on the set where is the set of all such that for all . Since the random variable is continuous and is countable, has probability 1, and this is sufficient for the conclusions of the theorem to hold. Also note that the collection is VC class of functions with VC dimension .
We apply Theorem A.1 with the distance metric . From Proposition 2.4:
[TABLE]
Next, we construct a function which satisfies condition (2) of Theorem A.1 with respect to the distance . Note that we need to satisfy:
[TABLE]
or inverting it,
[TABLE]
Hence, from Proposition 2.4 we need to satisfy:
[TABLE]
which further implies:
[TABLE]
Parametrizing we have:
[TABLE]
Hence inverting:
[TABLE]
which immediately implies as defined in Theorem A.1.
It also follows that this pseudo-distance provides an upper bound on the variability of the difference between the loss functions at any two :
[TABLE]
Finally we need to find which satisfies condition (3) of Theorem A.1. As is a VC class of functions, we can follow the same line of argument in Section 2.4 of Massart et al. [2006]:
[TABLE]
for all . The quantity in the above display is the VC-dimension of the class of all half-spaces in where . Solving the equation we get:
[TABLE]
Hence we need to find such that:
[TABLE]
Solving these two inequalities and ignoring constants we get:
[TABLE]
Using the above we conclude using Theorem A.1:
[TABLE]
for all . Here also is a different constant than before, which is now a function of some universal constant and , but it does not depend on . Using Proposition 2.4 and equation (B.3), we get the following concentration bound:
[TABLE]
Consequently:
[TABLE]
which, along with Assumption 2.3 yields:
[TABLE]
and
[TABLE]
which can be rewritten using Assumption 2.3 as:
[TABLE]
Combining equation (B.4) and (B.5) we conclude:
[TABLE]
for all and for some constant not depending on with . which completes the proof of the concentration bound.
The upper bound on the expectation follows from this exponential tail bound using the following calculation:
[TABLE]
which completes the proof of minimax upper bound.
B.3 Proof of Theorem 2.11
To obtain a lower bound on the minimax error, we use Assouad’s Lemma Assouad [1983] which we state below for convenience:
Lemma B.4**.**
[Assouad’s Lemma]*
Let (or ) be the set of all binary sequences of length . Let be a set of measures on some space and let the corresponding expectations be . Then:*
[TABLE]
where is an estimator based on i.i.d. observations , denotes the -fold product measure of , is the Hamming distance and means .111For some discussions and applications of this lemma, see Tsybakov [2009].
To apply this lemma in our model, define for small :
[TABLE]
We will motivate the choice of in the later part of the proof. Observe that, is same for all and equals . For notational simplicity, define . Now, for any , define and . This establishes a 1-1 correspondence between and , with . For any define the joint distribution of as:
2. 2.
The Gaussian distribution of trivially satisfies Assumption (A2). In the following lemma we show that this construction also satisfies Assumption (A1).
From now on, we define for notational simplicity.
Lemma B.5**.**
The above construction of satisfies a part of Assumption 1, i.e.
[TABLE]
Proof.
Fix such that . Then,
[TABLE]
The last inequality is valid when , which happens for sufficiently small. ∎
We use the notation if and differs only in position for . So, in order use Assouad’s lemma, we need an on when for any . Fix and and such that . Using the standard relation between the total variation norm and Hellinger distance, we have:
[TABLE]
To make the minimax lower bound non-trivial, we will choose in a way that ensures . Towards that, we need the following lemma:
Lemma B.6**.**
If = Ber() and = Ber() with , then where .
The proof of this Lemma can be found in section B.12 of supplement. For the rest of the proof, define
[TABLE]
for . Now,
[TABLE]
We next divide the domain of into two sub-parts and compute the corresponding values of , on these sub-parts.
Case 1: .
**Case 2: ** . Note that, in this case, , if , otherwise.
Lemma B.7**.**
Under Case 1, where .
Proof.
First assume that, . Then,
[TABLE]
Next, consider the case that . Then, but . Hence,
[TABLE]
The case when follows in the exact same manner, by symmetry. ∎
We are now in a position to tackle as shown below.
[TABLE]
We will analyze the expectation of each summand separately. Define i.e. is a vector of dimension which we obtain by removing the first co-ordinate of , and let be defined similarly in terms of . We have:
[TABLE]
For the second part, observe that,
[TABLE]
Using this observation, we get,
[TABLE]
where is an absolute constant. Putting together B.7, B.8 and B.9, we get,
[TABLE]
Set . If we choose , then . So we have
[TABLE]
for all large , as . Now we can relate Hamming distance to distance via
[TABLE]
and use Assouad’s lemma to deduce:
[TABLE]
for some constant . Finally, let be any estimator assuming values in . Define to be the projection of on the hypercube, i.e.
[TABLE]
Then for any we have:
[TABLE]
Using this relation we can conclude that:
[TABLE]
where . To prove that the minimax rate cannot be improved upon , one can resort to a minimax construction taking . The construction and the rest of the proof follow a similar pattern as above, and are skipped for the sake of brevity.
B.4 Proof of Theorem 2.14
This proof is based on Theorem A.3. Recall that is the collection of all models with for . As mentioned previously, the model has VC dimension . Following the same line of argument as in the proof of Theorem 2.6 we can conclude:
[TABLE]
and the values of can be taken as:
[TABLE]
We know the function increases between and then decreases. As , the sequence of VC dimensions is increasing: . Next, we establish that for all large . Assume to the contrary that . Then:
[TABLE]
which is a contradiction since the LHS goes to 1 as . This immediately implies that:
[TABLE]
and
[TABLE]
This proves that
[TABLE]
Hence in our case,
[TABLE]
Now we need to choose a penalty function such that:
[TABLE]
If we choose
[TABLE]
then a permissible penalty function is given by , provided we can show that . Towards that end:
[TABLE]
This ensures that our choice of are valid. Applying Theorem A.3 along with the penalty function we obtain the following concentration bound on the excess risk:
[TABLE]
Taking :
[TABLE]
Putting this back in equation (B.10) we get:
[TABLE]
which further implies:
[TABLE]
We now argue that the remainder term as . First, observe that as . This is because:
[TABLE]
As both and diverge as , it suffices to establish , to demonstrate that . Towards that end:
[TABLE]
This completes the proof of the concentration bound on the excess risk. Using Proposition 2.4 we have:
[TABLE]
Thus, we have:
[TABLE]
and
[TABLE]
Combining equations (B.12) and (B.13) we conclude:
[TABLE]
for all , for some constant not depending on , with and .
The proof of minimax upper bound of this estimation problem follows immediately from the exponential concentration bound on the estimation error:
[TABLE]
B.5 Proof of Corollary 2.16
In Theorem 2.14 we have established:
[TABLE]
for some constant which does not depend on . Taking we get:
[TABLE]
because as . Hence, if , we conclude from above concentration bound:
[TABLE]
which completes the proof.
B.6 Proof of Proposition 3.1
By the definition of population loss function we have:
[TABLE]
where we define as the rank of the scalar number among the numbers in increasing order. Our claim is that for any realization of the vectors , we have:
[TABLE]
To observe this, first note that from Assumption 1.1, the ordering of the vectors is same as . Hence the above inequality follows from applying rearrangement inequality. The proof of the proposition also immediately follows from the inequality.
B.7 Proof of Proposition 3.8
First we show that under Assumption 3.2 we can lower bound the excess risk in terms of the probability of a suitably chosen wedge shaped region. For and for any define the region as:
[TABLE]
where for any matrix , its row is denoted by . Then we have:
[TABLE]
Defining to be:
[TABLE]
we obtain:
[TABLE]
As this inequality is true for any , optimizing the same way as in the proof of Proposition 2.4 we conclude that:
[TABLE]
This concludes the proof.
B.8 Proof of Theorem 3.9
The proof of this theorem is quite similar to proof of Theorem 2.6. Hence we will skip some details here. As before, we work with the distance metric over the parameter space . Borrowing the notations from Theorem A.1 and using Proposition 3.8 we have:
[TABLE]
for . Hence the function satisfy:
[TABLE]
Parametrizing we get:
[TABLE]
Now consider the class of function
[TABLE]
As is average of functions, where each function constitutes a VC class of VC dimension of order , the collection has bounded uniform entropy integral. More precisely we have for any measure :
[TABLE]
as each function is bounded by and is the covering number of with respect to measure. The variability of the centered function can be bounded as:
[TABLE]
Finally to apply Theorem A.1 we need to obtain which satisfy condition (2) of that Theorem. Using Theorem 8.7 of Sen [2018] one can choose as:
[TABLE]
Now from Theorem A.1 we need to find for all the values of such that . From the above expression of , one can immediately conclude that:
[TABLE]
Hence we can take
[TABLE]
such that condition (3) of Theorem A.1 will be satisfied for all such that . Now we need to solve the equation
[TABLE]
to get . From the expression of we have:
[TABLE]
The above inequality will be satisfied if:
[TABLE]
Ignoring the constant , This will be satisfied if we take to be:
[TABLE]
Using this value of we conclude from Theorem A.1:
[TABLE]
or all and for some constant which does not depend on . Now Proposition 3.8 implies along with equation (B.17):
[TABLE]
for all , where
[TABLE]
and for some constant but does not depend on . This concludes the proof.
B.9 Proof of Theorem 3.12
The proof technique is essentially similar to that of Theorem 2.14, hence we will skip some details. As in the proof of previous theorem, the distance function that will be used heavily in the proof:
[TABLE]
From Assumption 3.11 we confine ourselves to search the best model upto sparsity level . As in the proof of Theorem 2.14 we will use Theorem A.3 to show the model selection consistency here. Recall that is defined to be the collection of all the models with all the such that . Hence,
[TABLE]
Now is sum of order many functions each of which has VC dimension of order (argued in Theorem 2.14). Using the same argument as in Theorem 3.9 we say that,
[TABLE]
As before, define . Using the same calculation as in the proof of Theorem 3.9 we conclude:
[TABLE]
and
[TABLE]
Hence, the value of can be takes as following similar calculation as in Theorem 3.9:
[TABLE]
Now we have to take the penalty function so that it satisfies:
[TABLE]
for some constant (as mentioned inTheorem A.3), where is same as defined in Theorem A.3. Taking a valid choice of penalty function will be:
[TABLE]
Before applying Theorem A.3 with this choice of penalty function, we need to argue that this choice of is valid, i.e.
[TABLE]
for some constant . From the definition of we have:
[TABLE]
Similar analysis as in Theorem 2.14 yields: and . Hence we have:
[TABLE]
as . Hence we can find some constant (in-fact one can take for all large ) such that equation (B.19) holds. Now, using Theorem A.3 we conclude:
[TABLE]
where . Now in the RHS of equation (B.20) we can replace the infimum by its value at , the true sparsity of and get the following concentration bound on the excess risk:
[TABLE]
Using Proposition 3.8 we conclude:
[TABLE]
where
[TABLE]
via same argument in Theorem 3.8 and
[TABLE]
This concludes the proof of the Theorem with .
B.10 Proof of Theorem A.3:
The proof is quite long and involved. Before going into details, we define some notation which will be frequently used throughout. Fix and satisfying the condition of the theorem:
For all , . 2. 2.
. 3. 3.
for all .
Recall that is defined to be the optimal model, i.e.:
[TABLE]
Fix . Then by definition of :
[TABLE]
which implies:
[TABLE]
The rest of the proof is organized as follows. We first show that:
[TABLE]
for any , which implies from the union bound and the fact that ,
[TABLE]
Now, using equation (B.22), we obtain with probability larger that :
[TABLE]
Multiplying both sides by 2, we get:
[TABLE]
As this is true for any , we conclude:
[TABLE]
which implies for all :
[TABLE]
Integrating with respect to we get the following upper bound on the expectation:
[TABLE]
Now we prove the bound (B.23), for which we use Theorem A.4. Our function class will be:
[TABLE]
First, we observe that the functions are uniformly bounded in terms of :
[TABLE]
Next, we bound the variability of the functions. Define . We have:
[TABLE]
Applying Theorem A.4 yields with probability larger than :
[TABLE]
Now we bound :
[TABLE]
For the next analysis, define . We first analyze as follows:
[TABLE]
Next we analyze using Lemma A.2:
[TABLE]
where we define . We can relate to to in the following way:
[TABLE]
Hence we have:
[TABLE]
Using Lemma A.2 we conclude:
[TABLE]
Combining equation B.28 and B.27 we conclude that:
[TABLE]
Putting this bound in equation B.26 we have:
[TABLE]
For large enough , clearly . This completes the proof.
B.11 Discussion on Lemma 6.1
Lemma B.8**.**
For any fixed , define to be -angular spherical cap around , i.e.
[TABLE]
Then we have
[TABLE]
for . The last inequality follows from the assumption .
This lemma is a well-known fact in convex geometry. Note that, Lemma 6.1 and Lemma B.8 are in different scale as one of them involves the angle and the other one involves the distance. In the following Lemma we bridge this gap:
Lemma B.9**.**
For and , we have:
[TABLE]
Proof.
Note that where . If and then . Hence we have:
[TABLE]
which completes the proof. ∎
Finally using Lemma B.9 we get the upper bound on . The lower bound can also be found in convex geometry literature. Combining them together, we get Lemma 6.1.
B.12 Proof of Lemma B.6
Define . From the definition of Hellinger distance between two Bernoulli Random variables, we get,
[TABLE]
In the second last line we use mean value theorem:
[TABLE]
for some between [math] and . As our parameter space is , we have for any choice of . Hence, and which immediately implies and , which, in turn, validates and . Using this in equation B.30 we conclude:
[TABLE]
B.13 Proof of Lemma 6.5
[TABLE]
Appendix C A discussion of the model with intercept
The binary choice model in the presence of intercept can be formulated as follows:
with almost surely. 2. 2.
where .
The maximum score estimator can be defined as:
[TABLE]
with the population score function being . In this model we can write the function . We take our parameter space to be . For notational simplicity, define . The transition assumption (Assumption 2.2) remains unchanged under the intercept model. Assumption 2.3 can be generalized for this model as follows:
Assumption C.1**.**
For all sufficiently close to ,
[TABLE]
Consider the linear transformation where:
[TABLE]
with being orthogonal extensions to a basis of . Note that depends on , but this will be suppressed in the notation. The following lemma presents conditions on the distribution of under which Assumption C.1 is valid.
Lemma C.2**.**
Suppose there exists and a constant such that for all where and the bound is independent of and the dimension . Then
[TABLE]
holds for and for all .
As the wedge condition is only valid in a neighborhood of the true , we need to establish the consistency of the maximum score estimator in order to prove the rate of convergence results.
Lemma C.3**.**
Under Assumption 2.2 and Assumption C.1 we have
[TABLE]
when . Furthermore, under Assumption 2.13, the result continues to hold when .
We next argue that the rate of convergence results in (Theorem 2.6 and Theorem 2.14) hold for the intercept model by slight modifications to the previous proofs.
Theorem C.4**.**
Under Assumption 2.2, C.1 and 2.13 we have:
[TABLE]
where
[TABLE]
for the slowly growing regime , and
[TABLE]
for the fast growing regime .
Under the intercept model, our class of classifier is:
[TABLE]
The VC dimension of this class is (For the VC dimension is at most ). By the same arguments as in the proof of Proposition 2.4 we can show that
[TABLE]
As before, we next apply Theorem A.1. The first condition of the theorem remains valid as our parameter space admits a countable dense subset. The distance function changes to the following:
[TABLE]
The remainder of the proof remains completely unchanged as can be verified by inspection.
Remark C.5**.**
It is not clear whether the minimax upper bound results in Theorems 2.6 and 2.14 hold. Recall that, to prove the minimax upper bound in these theorems, we used an exponential tail bound on the probability that for every , derived via Theorem A.1, using the fact that the wedge condition Assumption 2.3 held for all . In the intercept model, the wedge condition only holds on a restricted part of the parameter space, and the exponential tail bound cannot be established for all . Nevertheless, the minimax lower bound rates obtained in Theorems 2.11 and 2.18 remain exactly the same, as we can take in the minimax constructions that arise in their proofs. Of course, the space of distributions changes, as we have introduced the intercept. We can rewrite these results as follows.
Theorem C.6**.**
For the slowly growing regime , we have :
[TABLE]
for some constant that does not depend on . For fixed, the lower bound is of the order . The supremum is taken over all distributions corresponding to binary response models satisfying Assumptions 2.2 and C.1 for some regression parameter (viewed as a functional of ) but with held fixed.
Theorem C.7**.**
For the fast growth regime , we have:
[TABLE]
for some constant not depending on . For the case fixed, the lower bound is of the order of . The supremum is taken over the same class of distributions as in Theorem C.6.
C.0.1 Which distributions satisfy Assumption C.1 ?
As stated in lemma C.2 we need the joint density of to be lower bounded by some non negative constant to establish the lower bound. Here we show that, under fairly general restrictions, any elliptically symmetric distribution and satisfies the assumption.
Lemma C.8**.**
Suppose the distribution of belongs to a consistent family of elliptical distribution with mean 0 i.e. the density of has the form:
[TABLE]
with being a full rank matrix. If (density generator of two dimensional marginal of ) is decreasing function on with for all and there exists constants such that
[TABLE]
for all then satisfies the assumption of Lemma C.2.
Proof.
The density of is where . Then density of is , where is the leading block of . Now, if we confine ourselves on a ball of radius then:
[TABLE]
Hence and assumption (A2:intercept) is satisfied. ∎
Lemma C.9**.**
Suppose the elements of the random vector are independent and each component has a log concave density symmetric around 0 and variance 1. Then, there exists constants such that on a circle of radius . Hence, Assumption (A2:intercept) is satisfied for all such that .
Proof.
Denote the density of as . From the strong unimodality property of log concave densities, each has mode at 0. Also we have
[TABLE]
for all . [See equation (2.2) of Bobkov and Chistyakov [2015]]. Hence under variance = 1. Note that, as each component of has a symmetric strongly unimodal density, so does for any . Consider as defined before Lemma C.2. Let , then is also strongly unimodal with mode at [math] (Recall that density of linear combination of random variables with log concave density is also log concave and any symmetric log concave density has mode at 0). As marginals of log-concave density is log-concave, the density of is also log-concave i.e.
[TABLE]
where is a log-concave function on with mode at 0. Then by the Jacobian transformation:
[TABLE]
where . Some properties of ’s are immediate:
is concave. 2. 2.
is symmetric around 0 as is symmetric around 0.
As has a two dimensional log concave density in and , there exists an absolute constant such that . (e.g. see Ball [1988]). Next, we show that, there exists a universal constant such that for all , for all which implies on a circle . Fix . Denote by . Then, due to concavity, lies below the line joining and i.e.
[TABLE]
for . This implies:
[TABLE]
The last equation follows from the lower bound on the mode of two dimensional log concave density (see Lemma 6 of Ball [1988]). Hence . Define . As is strictly increasing on we conclude . This immediately implies for from the fact that has mode 0. Also the value of does not depend on , which implies, . This completes the proof with . ∎
C.1 Proof of Lemma C.3
First we show that as ,
[TABLE]
i.e. the class of functions is Glivenko-Cantelli class which is equivalent to showing (for details see Pollard [1981]):
There exists , an envelope of such that . 2. 2.
for all , where is the covering number of the set with respect to norm.
Clearly is an integrable envelope of . Now is VC class of VC dimension . Hence, we have:
[TABLE]
for some universal constant and . Using this, we have:
[TABLE]
if which completes the proof.
In the previous step we have established that uniformly over . Now we need to prove converges to . Towards that we need the following Lemma:
Lemma C.10**.**
Given any and such that , we can find with such that
[TABLE]
We defer the proof of this lemma to the next subsection. Using the same proof as Proposition 2.4 we have:
[TABLE]
which is now true for under the assumptions of Theorem C.3. Suppose , then using Lemma C.10 we have:
[TABLE]
[TABLE]
which completes the proof for going to 0.
The same proof works for under our assumption , because, what is really needed in the above proof is the condition where is the VC dimension of the set of classifiers under consideration. When , under the sparsity assumption, and therefore by our assumption in this case as well.
C.2 Proof of Lemma C.10
Under the assumption that in our model, we have for any :
[TABLE]
where with . Now a fix with . Define for some which will be chosen later. Suppose :
**Case 1: ** Suppose . Then
[TABLE]
**Case 2: ** Suppose . Then
[TABLE]
Hence . Now by triangle inequality and using the fact that . Therefore,
[TABLE]
To conclude the proof we choose such that i.e. .
C.3 Proof of Lemma C.2
From the transformation , we can write and where . We divide the proof into three cases:
Case 1: Suppose . The probability of the wedge shaped region can be written as:
[TABLE]
which is the probability of the region between the straight lines: and . The intersection of these two lines is , the line meets the -axis at and the line meets the -axis at . From our assumptions we have (say). Hence, for all , indicating that the intersection points with the - axis (denoted by J,K) lie within a circle of radius around origin.
**Case 1.1: **Suppose the point I is inside the circle of radius . The points are inside by definition.
Denote to be the midpoint of . If we denote the angle to be , then (which directly follows from the slope of the lines and from the observation that is isoceles) and . The length of the side (diameter of the circle) which implies:
[TABLE]
as . On the other hand we have following upper bound on :
[TABLE]
Combining C.2 and C.4 we have . Define to be the point where extended meets the circle and to be the point where extended meets the circle. ( may be equal to ). The triangle is inside the circle and
[TABLE]
Now as the maximum possible distance of from the origin is and is on the opposite side of with respect to the -axis. Hence, . Next,
[TABLE]
Recall that from C.3 it is easy to see . Hence, we have . which implies that :
[TABLE]
**Case 1.2: **Suppose the intersection point is outside of the circle.
Here, the length of is as is outside the circle and the maximum possible distance of from the origin is . Using this, we have:
[TABLE]
Also, from equation C.3, we obtain . Combining these bounds, we have, . Let the line cuts the circle at . Consider the triangle . Then the area of this triangle is:
[TABLE]
where . By the same logic as before, , and . Hence, area of . Using this, we have:
[TABLE]
**Case 2: **Suppose and . Then the lines and meet on the -axis, i.e. .
Consider the triangle where and are the intersection points of the lines with the circle. Now the maximum possible distance of from the origin is which implies . From C.5 we have . Combining these, we get:
[TABLE]
**Case 3: **Finally suppose and .
Consider the rectangle . Here . Also . Hence
[TABLE]
which establishes:
[TABLE]
Combining equations C.6, C.7, C.8 and C.9 we conclude that Assumption (A2:upper) is valid for this intercept model with .
Appendix D Another version of rate theorem
In this section we present a version of Theorem 3.2.5 of Van Der Vaart and Wellner [1996] which provides the rate of convergence of a generic -estimator along with an exponential tail bound, under appropriate conditions. This theorem can be applied instead of Theorem A.1 to establish rate of convergence along with finite sample concentration bound.
Theorem D.1**.**
Suppose be a stochastic processes indexed by a set and be a deterministic process which takes the form: and . Define . Assume for every in a neighborhood of :
[TABLE]
Suppose that for every and for sufficiently small , the centered process satisfies:
[TABLE]
for functions such that is decreasing for some . Let satisfies
[TABLE]
for every . If the sequence takes value in and satisfies and converges to [math] in outer probability, then . If all the above conditions are valid for all and , then we don’t need consistency and we can obtain the following finite sample concentration bound:
[TABLE]
In addition, assume (w.l.o.g. take for simplicity of notation) for all and the existence of such that:
[TABLE]
Then, the following exponential concentrations obtain, for all :
If and , then
[TABLE] 2. 2.
If and , then
[TABLE]
Here the constants may be different in Case 1 and Case 2, but they don’t depend on .
Proof.
For simplicity let’s assume the conditions are valid for . We establish the finite sample concentration here. Fix : Define for . Also define and without loss of generality assume . We also need the following quantities to apply Talagrand’s inequality:
[TABLE]
[TABLE]
[TABLE]
[TABLE]
Now we manipulate the last sum. For ease of understanding we divide the rest of the proof into three parts. First, assume . From the assumption of the theorem, there exists such that . As , for all , .
[TABLE]
Finally assume that . Then for all which implies . Putting this we get:
[TABLE]
Next we solve the series for the case when . We assume here as stated in the theorem. Like before. lets assume . Then for all which implies . Also, as , we have for all large . Hence we have:
[TABLE]
Finally let’s assume . Then we have the assumption . Assume . Then for all which implies . Here as we have for all large . Then we have:
[TABLE]
Remark D.2**.**
Here I have used the inequality and for notational simplicity. For more exact bound, one can use the fact that for all for all .
∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abrevaya and Huang [2005] Jason Abrevaya and Jian Huang. On the bootstrap of the maximum score estimator. Econometrica , 73(4):1175–1204, 2005.
- 2Assouad [1983] Patrice Assouad. Deux remarques sur l’estimation. Comptes rendus des séances de l’Académie des sciences. Série 1, Mathématique , 296(23):1021–1024, 1983.
- 3Bajari et al. [2008] Patrick Bajari, Jeremy T Fox, and Stephen P Ryan. Evaluating wireless carrier consolidation using semiparametric demand estimation. Quantitative Marketing and Economics , 6(4):299, 2008.
- 4Ball [1988] Keith Ball. Logarithmically concave functions and sections of convex sets in rn. Studia Math , 88(1):69–84, 1988.
- 5Bickel et al. [2009] Peter J Bickel, Ya’acov Ritov, Alexandre B Tsybakov, et al. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics , 37(4):1705–1732, 2009.
- 6Bobkov and Chistyakov [2015] Sergey G Bobkov and Gennadiy P Chistyakov. On concentration functions of random variables. Journal of Theoretical Probability , 28(3):976–988, 2015.
- 7Bousquet [2002] Olivier Bousquet. A bennett concentration inequality and its application to suprema of empirical processes. Comptes Rendus Mathematique , 334(6):495–500, 2002.
- 8Briesch et al. [2002] Richard A Briesch, Pradeep K Chintagunta, and Rosa L Matzkin. Semiparametric estimation of brand choice behavior. Journal of the American Statistical Association , 97(460):973–982, 2002.
