Sparse vertex discriminant analysis: Variable selection for biomedical classification applications

Alfonso Landeros; Seyoon Ko; Jack Z. Chang; Tong Tong Wu; Kenneth Lange

PMC · DOI:10.1016/j.csda.2025.108125·June 1, 2025

Sparse vertex discriminant analysis: Variable selection for biomedical classification applications

Alfonso Landeros, Seyoon Ko, Jack Z. Chang, Tong Tong Wu, Kenneth Lange

PDF

Open Access

TL;DR

This paper introduces a new method for classifying biomedical data by selecting important variables while handling complex data structures.

Contribution

The paper proposes two versions of sparse Vertex Discriminant Analysis (VDA) for class-specific variable selection in biomedical classification.

Findings

01

Sparse VDA adapts well to class-specific variable selection in simulated and real datasets.

02

The method is particularly effective for cancer classification using gene expression data.

03

Nonconvex distance-to-set penalties control the number of active variables in VDA.

Abstract

Modern biomedical datasets are often high-dimensional at multiple levels of biological organization. Practitioners must therefore grapple with data to estimate sparse or low-rank structures so as to adhere to the principle of parsimony. Further complicating matters is the presence of groups in data, each of which may have distinct associations with explanatory variables or be characterized by fundamentally different covariates. These themes in data analysis are explored in the context of classification. Vertex Discriminant Analysis (VDA) offers flexible linear and nonlinear models for classification that generalize the advantages of support vector machines to data with multiple classes. The proximal distance principle, which leverages projection and proximal operators in the design of practical algorithms, handily facilitates variable selection in VDA via nonconvex distance-to-set…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Diseases1

cancer

Figures8

Click any figure to enlarge with its caption.

A](#F3), circles and diamonds). Yet, our method is aided by increasing the correlation strength between all predictors from $[eqn]$ to $[eqn]$ . In the overdetermined regime $[eqn]$ , a high degree of correlation obscures support recovery unless the underlying signal is strong ([Figure 3A](#F3), [3B](#F3), orange markers). Comparing VDA_HomL0_ against VDA_L1_ further clarifies these results as the two methods differ only in the choice of penalty. The L1 version of VDA shrinks the fitted slope matrix $[eqn]$ so that the linear predictions based solely on the ground truth support $[eqn]$ , $[eqn

Keywords

ClassificationVariable selectionProximal algorithmsDistance majorizationℓ0-constrained optimization

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGene expression and cancer classification · Face and Expression Recognition · AI in cancer detection

Full text

Introduction

Advances in various technologies allow scientists to interrogate biological systems on a spectrum of organizational levels. Genomic and transcriptomic data are now particularly prevalent in the biomedical and life sciences. Regardless of the level of biological organization, modern data tend to be high-dimensional. It is generally recognized that analyzing small tractable subsets of the data is insufficient for proper inference. In addition to understanding how data are collected and generated, biomedical researchers traditionally had to possess enough domain knowledge to formulate well-motivated hypotheses and guess the important factors driving a measurement of interest. Owing to the complex evolutionary and biological processes underlying biomedical data, these guesses are now harder to make. Thus, variable selection looms large on the agenda of most statisticians confronted with such data, as modern data analyses typically amount to estimating sparse and/or low-rank structures from the data. Our focus is on sparsity in classification. Imposing low-rank structure is generally less scientifically informative than sparsity, which directly exposes the active predictors explaining a scientific phenomenon.

A few interesting examples of classification tasks in biomedical sciences include identifying gene expression patterns in microarrays of various human cancers (Dettling and Bühlmann, 2002; Guyon et al., 2002; Dettling, 2004; Weigelt et al., 2010), tumor prediction from imaging and histopathology data (Fuchs and Buhmann, 2011; Campanella et al., 2019), and disease diagnosis using electronic medical records (Xiao et al., 2018; Wang et al., 2019; Damotte et al., 2019). Our enumeration of applications and cited works is far from exhaustive. In each of the named applications, it is of interest to identify those predictors driving classification as well as to fit a robust classifier. Fortunately, biomedical scientists often have prior knowledge about which and how many predictors are linked to a response variable. For example, mutations in human genes such as the tumor suppressors BRCA1 and BRCA2 are associated with increased risk of breast cancer in women (Miki et al., 1994; Wooster et al., 1995). A tumor’s gene expression profile may also contain information about potential therapeutic targets, such as the ERBB2 gene coding for receptor tyrosine-protein kinase erbB-2 (HER2 receptor) that is overexpressed in several classes of cancers, including subtypes of breast, ovarian, gastric, and lung cancers (Slamon et al., 1989; Konecny et al., 2003; Yan et al., 2015; Oh and Bang, 2020). Life science data are not uniquely high dimensional, but they serve as compelling motivation for investigating variable selection (Bühlmann et al., 2014).

Classification is undergirded by a rich literature and a variety of practical methods. Support Vector Machines (SVM) stand out in the universe of classification algorithms for their simplicity, interpretability, speed, and theoretical guarantees (Smola and Schölkopf, 1998; Vapnik, 1999). Although the value of SVM has been bolstered by fundamental advances in statistical theory and computer hardware, it is also challenged by the growing size of modern datasets. The Achilles heel of SVMs is their practical limitation to binary classification. Heuristics, such as one-versus-rest and one-versus-one, are common in software implementations (Fan et al., 2008; Chang and Lin, 2011). Given data containing multiple classes, these strategies rely on multiple binary SVMs to discriminate between classes. Regardless of the heuristic used, fitting multiple classifiers separately increases training time, obscures statistical interpretation, and violates the principle of parsimony. Moreover, it may be difficult to ascertain which heuristic is most appropriate for problems where one must balance training time against prediction quality. For example, one-versus-rest may result in a combined classifier with higher error rates than one-versus-one (Allwein et al., 2000) unless care is taken to carefully tune hyperparameters (Rifkin and Klautau, 2004). The interested reader may consult Lauer and Guermeur (2011) for a succinct summary of multiclass SVMs. A true multiclass SVM was developed by van den Burg and Groenen (2016). We will have more to say about this advance later.

In contrast, more traditional approaches in the literature successfully address multiclass data via a single model. Discriminant analysis is notable for its flexibility in both dimension reduction and classification. In particular, several theoretical and algorithmic extensions to sparse Linear Discriminant Analysis (LDA) were developed in the last decade; see for example Witten and Tibshirani (2011), Clemmensen et al. (2011), and Gaynanova et al. (2016). Sparse LDA in its many guises has the added benefit of associating predictors to classes. The class of $[eqn]$ -regularized SVMs also has this capability under the one-versus-rest heuristic. In principle, addressing multiple classes directly and selecting predictors with strong associations to classes should improve prediction quality.

Vertex Discriminant Analysis (VDA) squarely addresses multiclass data (Lange and Wu, 2008), is compatible with a variety of penalties for variable selection (Wu and Lange, 2010), and can be extended to nonlinear classification (Wu and Wu, 2012). VDA is closely related to SVMs in the multiclass setting. Both GenSVM, the multiclass SVM of (van den Burg and Groenen, 2016), and VDA use vertex encodings based on $[eqn]$ -simplices to represent $[eqn]$ classes, apply decision rules based on proximity to vertices, and thus avoid ambiguous predictions compared to heuristic SVM methods (van den Burg and Groenen, 2016). The primary difference between VDA and GenSVM, as classifiers, is in how each relates a linear prediction $[eqn]$ to the vertex set $[eqn]$ . VDA assigns each instance $[eqn]$ with a vertex label $[eqn]$ and computes distances between $[eqn]$ and $[eqn]$ , whereas GenSVM projects $[eqn]$ onto vectors $[eqn]$ normal to hyperplanes separating each pair of classes $[eqn]$ and $[eqn]$ . An $[eqn]$ -penalized version of GenSVM might be able to select features via shrinkage, but it is unclear how one can extend the method to class-specific variable selection. We demonstrate how to adapt VDA so that it chooses a vertex encoding and leverages distance penalization (Chi et al., 2014) for this purpose.

Our main contributions are as follows:

We improve previous VDA algorithms by deriving a smooth, quadratic surrogate function. The Majorization-Minimization (MM) principle guarantees the descent property at each iteration and justifies the use of our local quadratic model. Minimization of the surrogate reduces to fast least squares.For variable selection, we dispense with convex shrinkage-based penalties and instead impose explicit nonconvex $[eqn]$ constraints on the VDA loss. Distance majorization allows us to fold sparsity constraints into the loss as a convenient distance penalty. This action yields a greedy algorithm for variable selection. In particular, we illustrate the flexibility of our approach by showing how structured sparsity constraints, such as in class-specific variable selection, are readily incorporated in model fitting.We provide empirical evidence of reliable support recovery in simulated and benchmark datasets. Our comparisons to previous VDA algorithms facilitate comparisons to several other classification and variable selection schemes. We also apply our method to a large real dataset on cancer gene expression.

The remainder of this text is organized as follows. Section 2 presents a smooth loss model for VDA that is readily approximated by a quadratic surrogate in the MM framework, reducing model fitting to solving a sequence of least squares problems. We then introduce distance penalization as a means to enforce constraints, particularly nonconvex $[eqn]$ -type constraints. Section 3 comments on implementation, algorithmic convergence, cross validation, and other practical concerns. Section 4 describes and summarizes our numerical experiments on both simulated and real-world data. We offer a few concluding remarks on sparse VDA and other $[eqn]$ -type estimators in Section 5.

Proposed methods

As a prelude to introducing our generic method, we first review VDA with 𝜖-insensitive pseudo-distances and derive a quadratic surrogate that both simplifies the associated empirical loss and converts minimization to iterative least squares. Next, we present a distance penalized version of VDA by introducing sparsity sets as a mechanism for variable selection. The section concludes with a brief remark on distance penalties applied to nonlinear VDA.

Here we clarify the notation used throughout this article. We restrict our attention to problems formulated over Euclidean space $[eqn]$ . Vectors are written in lowercase boldface, for instance $[eqn]$ , and matrices are written in uppercase boldface, for instance $[eqn]$ . The notation $[eqn]$ indicates the Euclidean norm on the ambient space. In some cases we decorate a norm with a subscript, for example the to indicate the number Frobenius norm $[eqn]$ , to avoid ambiguity. Through abuse of notation, we may write $[eqn]$ of nonzero components in a vector $[eqn]$ . The superscript $[eqn]$ indicates a vector or matrix transpose. The lowercase letter $[eqn]$ denotes the number of selected parameters; equivalently, the lowercase letter $[eqn]$ is reserved for the overall sparsity level. The maximum number of parameters $[eqn]$ is understood from context. We reserve the bold symbols $[eqn]$ for denoting standard basis vectors in $[eqn]$ . In simulations we decorate objects representing the ground truth with a subscript asterisk, $[eqn]$ . Finally, the letters $[eqn]$ and $[eqn]$ are reserved for inner iterations and outer iterations, respectively. For example, $[eqn]$ is an estimate at iteration $[eqn]$ to the problem $[eqn]$ parameterized by a penalty coefficient $[eqn]$ , whereas $[eqn]$ denotes one of its solutions, either exact or approximate, at level $[eqn]$ and so on.

Majorizing ϵ-insensitive empirical risk

2.1.

Given a dataset with $[eqn]$ samples, $[eqn]$ predictors, and $[eqn]$ categories, the goal in VDA is to minimize the empirical risk

[eqn]

Here $[eqn]$ is the feature vector for sample $[eqn]$ , $[eqn]$ is the sample’s label encoded as a vertex of a standard simplex in $[eqn]$ , and the entries of $[eqn]$ are regression coefficients to be estimated. When $[eqn]$ , the simplex is a line segment, an equilateral triangle, or a regular tetrahedron, respectively. Note that our encoding of classes in vertex space has an extra dimension compared to the more parsimonious parameterization used in previous VDA models (Lange and Wu, 2008; Wu and Lange, 2010; Wu and Wu, 2012). For example, when $[eqn]$ the vertices of an equilateral triangle in $[eqn]$ can be projected to $[eqn]$ to reduce the number of free parameters. We deliberately choose the overparameterized version as it will be important later in developing class-specific variable selection. Specifically, we restrict vertex encodings to standard basis vectors $[eqn]$ .

In any case, the predicted value for sample $[eqn]$ ’s class vertex $[eqn]$ is $[eqn]$ . We collect the rows $[eqn]$ into a feature matrix $[eqn]$ and the rows $[eqn]$ into a class matrix $[eqn]$ . To fit $[eqn]$ to $[eqn]$ , we minimize the squared 𝜖-insensitive risk function

[eqn]

The residual $[eqn]$ for sample $[eqn]$ incurs no loss when $[eqn]$ lies within a Euclidean $[eqn]$ -ball around its assigned vertex $[eqn]$ . Otherwise, the residual contributes the reduced loss $[eqn]$ to the risk. If $[eqn]$ denotes a collection of $[eqn]$ class vertices and $[eqn]$ an estimated coefficient matrix, then an unassigned feature vector $[eqn]$ is assigned to a class through the map

[eqn]

singling out the closest vertex. Inclusion of an intercept is beneficial when class numbers are unbalanced, and easily is achieved by adding an intercept row to $[eqn]$ and a corresponding constant column $[eqn]$ to $[eqn]$ .

The squared $[eqn]$ -insensitive risk, $[eqn]$ , is a differentiable alternative to its un-squared counterpart used in previous formulations of VDA (Wu and Lange, 2010). To leverage second-order information, we construct a quadratic surrogate in piecewise fashion based on the majorization (Lange, 2016)

[eqn]

This complicated formula is an example of distance majorization (Chi et al., 2014). Indeed, $[eqn]$ is best understood as the distance from $[eqn]$ to $[eqn]$ , the ball of radius $[eqn]$ centered at the origin. Easy calculations yield the projections $[eqn]$ when $[eqn]$ and $[eqn]$ when $[eqn]$ . In applying this majorization term-by-term to our VDA model (1), we set $[eqn]$ and $[eqn]$ based on the residual of sample $[eqn]$ . Thus, each term in the surrogate for our empirical risk model equals $[eqn]$ . This leads to the quadratic surrogate

[eqn]

where $[eqn]$ predicts $[eqn]$ and $[eqn]$ is replaced by $[eqn]$ with rows

[eqn]

[eqn]

In summary, our surrogate can be interpreted as an ordinary sum of squares with shifted responses.

Distance penalties, sparse projection, and variable selection

2.2.

Having derived a quadratic surrogate for the empirical risk (1), we now develop our approach to variable selection. Following previous work on proximal distance algorithms (Chi et al., 2014; Xu et al., 2017; Keys et al., 2019), we penalize $[eqn]$ by the squared Euclidean distance of $[eqn]$ from a constraint set $[eqn]$ . This creates the objective

[eqn]

for $[eqn]$ large. If $[eqn]$ denotes the Euclidean projection of $[eqn]$ onto the closed set $[eqn]$ , then we once again invoke the distance majorization $[eqn]$ , which can be combined with the surrogate (2) to yield

[eqn]

Here we understand $[eqn]$ as $[eqn]$ from context. Minimizing the penalized objective (3) via the surrogate (4) seeks a compromise between the empirical risk and the distance between estimated coefficients and their projection over the constraint set $[eqn]$ . The strength of the proximal distance framework is now apparent as it allows one to incorporate various constraints provided a projection onto the constraint set $[eqn]$ , $[eqn]$ , is relatively inexpensive to evaluate. The sublevel sets $[eqn]$ , $[eqn]$ , and $[eqn]$ corresponding to Euclidean, lasso, and sparsity penalties are natural candidates for variable selection. Let us elaborate on the interesting case of sparsity constraints.

In classification with $[eqn]$ features, one may attempt to select important variables by setting rows of the slope matrix $[eqn]$ to zero so that feature $[eqn]$ does not contribute to prediction; that is, $[eqn]$ . With this structure in mind, the structured sparsity set $[eqn]$ consists of those slope matrices with at most $[eqn]$ nonzero rows

[eqn]

Projection onto $[eqn]$ is achieved by ranking features according to their norms $[eqn]$ and then identifying the top $[eqn]$ features via a partial sort, which requires $[eqn]$ comparisons once given the row norms. Omitted rows are then set to $[eqn]$ . When an intercept $[eqn]$ is included in the model, the intercept is usually omitted in the constraint set. The intercept row is thus ignored in projection. In summary, projection onto a sparsity set amounts to computing order statistics on the row norms, $[eqn]$ , and thresholding rows to select the highest ranking features.

Implicit in the preceding discussion is the assumption that all selected features are equally important in discriminating classes. Group penalties, such as the group lasso, come to mind in relaxing this assumption. We consider grouping coefficients into columns of the slope matrix $[eqn]$ , each of which is associated to particular class. The sparsity set

[eqn]

restricts each column of a slope matrix $[eqn]$ to have at most $[eqn]$ nonzero components. In the context of VDA, the structure implied by $[eqn]$ is interpreted as selecting features associated with a class. One can further generalize by imposing a different model size restriction on each column. The logical extreme is the set

[eqn]

which simply selects $[eqn]$ components of $[eqn]$ without regards to row or column groupings of coefficients. Note that when $[eqn]$ or $[eqn]$ the hyperparameter $[eqn]$ is naturally limited to values in $[eqn]$ whereas the case $[eqn]$ has $[eqn]$ take values in $[eqn]$ .

To conclude this section, we note that the special cases of sparsity constraint sets $[eqn]$ , $[eqn]$ , and $[eqn]$ reflect assumptions about the number of important features and class heterogeneity. The ideas presented here can easily be extended to Euclidean or lasso penalties on rows or columns via distance penalties. A notable benefit of the distance penalty approach is that it avoids differentiability issues; the penalty $[eqn]$ is not differentiable whereas $[eqn]$ , the distance to the centered $[eqn]$ ball of radius $[eqn]$ , is smooth. Our numerical experiments in later sections consider comparisons to lasso-type penalties.

Extension to nonlinear VDA

2.3.

Linear VDA for feature selection easily generalizes to nonlinear VDA based on reproducing kernels. Construction of decision boundaries in sparse nonlinear VDA operates by selecting representative samples called support vectors in the SVM literature (Wu and Wu, 2012). Here we focus on defining and solving the optimization problem generated by nonlinear VDA. The goal is to estimate a classification vector $[eqn]$ by minimizing the objective $[eqn]$ . Given a reproducing kernel $[eqn]$ , $[eqn]$ assumes the functional form

[eqn]

in predicting the vertex components from a feature vector $[eqn]$ . Model parameters $[eqn]$ , including the intercepts $[eqn]$ , can be estimated by minimizing by criterion (3) with $[eqn]$ substituted for $[eqn]$ . Nonlinear prediction is achieved through the introduction of the reproducing kernel. The underlying model is still linear in its parameters. Explicitly, the loss and surrogate functions we use in the nonlinear case are

[eqn]

[eqn]

Sparsity on the rows of $[eqn]$ , induced by $[eqn]$ , select at most $[eqn]$ instances that contribute to a decision boundary. Alternatively, the heterogeneous group penalty $[eqn]$ allows VDA to select avatars specific to each group.

Algorithms

Let us briefly recall the MM principle (Lange, 2016) applied to the penalized objective (3). Given an anchor point $[eqn]$ , the surrogate (4) majorizes (3) so that $[eqn]$ holds for every $[eqn]$ in the essential domain of the penalized objective. At each iteration we select the unique stationary point $[eqn]$ minimizing the surrogate (4) around the anchor point $[eqn]$ . These observations combined with the tangency condition reveal the chain of inequalities

[eqn]

Thus, for fixed $[eqn]$ , minimizing surrogate (4) guarantees the descent property and drives the penalized loss (3) downhill. Alternatively, it is enough to decrease $[eqn]$ in cases where exact minimization is expensive and still preserve the logic of the inequalities above. The next two sections identify specific iteration schemes based on these principles. Additional details appear in Section A of the Supplementary Materials.

Direct minimization

3.1.

The directional derivative of the surrogate (4) in the direction $[eqn]$ is

[eqn]

At a stationary point $[eqn]$ vanishes for all $[eqn]$ . It follows that

[eqn]

and that the $[eqn]$ update is

[eqn]

To fit $[eqn]$ across a spectrum of values for $[eqn]$ and $[eqn]$ , we extract the thin singular value decomposition (SVD) $[eqn]$ and invoke the Woodbury matrix identity

[eqn]

This maneuver reduces the required matrix inversion to inverting a single diagonal matrix that depends on (a) the singular values $[eqn]$ and (b) the values of $[eqn]$ and $[eqn]$ . This inverse only needs updating when $[eqn]$ changes. Although the required thin SVD costs $[eqn]$ operations, where $[eqn]$ , it is done only once. Thus, minimization of the penalized objective (3) via exact minimization of surrogate (4) is limited by the time and space complexities of one thin SVD and several matrix-matrix multiplication operations.

Steepest descent

3.2.

One can take advantage of the quadratic form of (4) to derive a step size for gradient descent. The Taylor expansion

[eqn]

is exact. In the particular case $[eqn]$ , elementary calculus leads to the steepest descent update

[eqn]

where $[eqn]$ is defined by formula (10).

Convergence criteria

3.3.

Minimizing the penalized loss (3) with $[eqn]$ large is generally slow but is required to achieve optimality (Beltrami, 1970). We surmount the poor convergence characteristics of the penalty method by solving a sequence of subproblems (3) parameterized by an annealing schedule $[eqn]$ . The idea is to choose an annealing schedule with $[eqn]$ small (for example, $[eqn]$ ) and gradually increasing $[eqn]$ to a large maximum value (say 10^6^ to 10^8^). Unfortunately, the quality of the solution to the inner problem of minimizing the objective (3) for fixed $[eqn]$ affects the quality of the outer solution where $[eqn]$ reaches its maximum. It is therefore crucial to assess the quality of the inner solutions. These issues merit a brief discussion of convergence criteria.

Our previous work relied on the condition

[eqn]

where $[eqn]$ is a control parameter that sets a minimum level of progress per inner iteration on a relative scale (Keys et al., 2019; Landeros et al., 2022). In retrospect, the gradient condition

[eqn]

is safer because convergence can be quite slow for large $[eqn]$ even under Nesterov acceleration. The tangency condition of the MM principle conveniently allows one to compute the gradient $[eqn]$ through the equivalent expression $[eqn]$ . Controlling the distance term across outer iterations $[eqn]$ yields an approximate solution $[eqn]$ . Thus, we impose conditions on the distance penalty

[eqn]

where $[eqn]$ quantifies distance to the sparsity set. Here the control parameter $[eqn]$ defines a satisfactory level of closeness to $[eqn]$ , and the dual criteria involving $[eqn]$ avoid excessive computing if progress slows dramatically. Our choices allow one to gradually guide solution estimates towards the constraint set while avoiding slow convergence due to large values of $[eqn]$ . It is worth emphasizing that convergence is declared when both the gradient condition and the closeness condition are satisfied, regardless of whether $[eqn]$ has hit its maximum.

Proximal distance iteration

3.4.

Algorithm 1 summarizes the general minimization procedure in proximal distance iteration, including an early exit condition on the annealing path and Nesterov acceleration. Warm starts are implicit in Algorithm 1 as one moves from a $[eqn]$ -penalized subproblem to the next subproblem. The final projection step is justified when an approximate solution $[eqn]$ is close to a sparsity set $[eqn]$ in Euclidean distance; that is, when feature selection finally stabilizes.

Let us now briefly address the theoretical convergence properties of Algorithm 1. The sparsity constraints described in Section 2.2 are defined by closed but nonconvex sets. In principle, projection operators onto nonconvex can be multivalued. For example, projection of the point (5, 2, 2, 3) onto the sparsity set $[eqn]$ is nonunique due to coordinate ties. Fortunately, the theoretical results from Keys et al. (2019) buttress Algorithm 1 under Zangwill’s global convergence theorem when the penalty coefficient $[eqn]$ is fixed (Luenberger, 1984). First, $[eqn]$ is closed for any choice of $[eqn]$ and nonempty. Second, the $[eqn]$ -insensitive loss $[eqn]$ is coercive whenever the predictor matrix $[eqn]$ has full rank. This implies the penalized loss $[eqn]$ is also coercive. Finally, observe that the surrogate generated by distance majorization is related to the possibly multivalued proximal mapping (Parikh, 2014) by

[eqn]

Because set projections preserve closedness, it suffices to show that the projection operator $[eqn]$ is single-valued almost every-where to satisfy the hypothesis of Proposition 8 from Keys et al. (2019). For fixed $[eqn]$ and $[eqn]$ this guarantees that all limit points of the proximal distance iterates are stationary points, $[eqn]$ . In fact, convexity of the loss $[eqn]$ guarantees that every surrogate $[eqn]$ is $[eqn]$ -strongly convex. The stronger conclusion of Proposition 11 from Keys et al. (2019) applies and allays concerns regarding selection of projected points when $[eqn]$ is set-valued.

Tuning hyperparameters and probing stability

3.5.

Before discussing implementation specifics, we note that the choice of sparsity set, either $[eqn]$ or $[eqn]$ as defined in (5) and (6), functions as a hyperparameter determining the set of selected variables. Changing the size of the target feature set $[eqn]$ fundamentally changes the nature of the optimization problem. In contrast, perturbing the penalty constant in a lasso penalized model merely shifts or perturbs parameter estimates along a solution path (Hastie et al., 2001). Thus, model size is only indirectly determined by the penalty constants.

Because a causal or optimal feature set is rarely known in advance, the effectiveness of our sparse VDA algorithms hinges on generating candidate feature sets and selecting optimal models. Cross validation is a practical way to evaluate the classification error of candidate models. We use a nested procedure as advocated by Cawley and Talbot (2010) which gives a realistic, albeit imperfect, assessment of model performance. For a given dataset $[eqn]$ , we implement cross validation by partitioning the samples into two subsets, one for $[eqn]$ -fold cross validation (the tuning phase) and one for evaluating out-of-sample prediction (the test phase). The former subset is further split into a training set, used to fit models under different hyperparameter settings, and a validation set, used to identify an optimal set of hyperparameters. Samples in the latter testing set never appear in the cross validation procedure. In the tuning phase, we propose a pair $[eqn]$ , where $[eqn]$ is a kernel scale parameter in the nonlinear case, by sampling from a grid and fit a VDA classifier. Note that in the linear case we simply set $[eqn]$ . Each VDA classifier is then evaluated on a validation set so that a straightforward grid optimization can select the optimal pair based on the $[eqn]$ -fold cross validation scores. Given an optimal pair $[eqn]$ , we then sample the model size hyperparameter $[eqn]$ from a grid and fit a sparse VDA classifier. Once again, we select an optimal model size $[eqn]$ via grid optimization on classification accuracy. Ties in grid optimization should favor smaller values of $[eqn]$ ; that is, we select the more parsimonious model. Crucially, the same training and validation sets are used for tuning the two sets of hyperparameters. Fig. 1 summarizes our entire cross validation pipeline.

Repeated cross validation can be used to address stability in hyperparameters, the selected model size $[eqn]$ , model parameter estimates, and the subset of selected features. Although the amount of computation involved seems onerous, our experience suggests that fast convergence of our VDA algorithms, even in the path following phase, is enough to make our proposed approach viable on a broad range of datasets.

Numerical experiments

In this section we demonstrate the feature selection capabilities of our VDA algorithms leveraging the proximal distance principle. Our first examples are simulation studies demonstrating how VDA reliably recovers sparse models without inflating classification errors. Next, we apply sparse VDA to classical microarray expression data (Dettling and Bühlmann, 2002) and compare the results to VDA subject to shrinkage-based penalties as described by Wu and Lange (2010). The remainder of this section benchmarks linear and nonlinear sparse VDA for datasets from the UCI Machine Learning Repository (Dua and Graff, 2019) and Elements of Statistical Learning (Hastie et al., 2001). We highlight how class-specific variable selection improves classification accuracy and produces tighter interval estimates of the true number of relevant features, particularly in fitting sparse VDA to RNA expression profiles of tumors in The Cancer Genome Atlas (TCGA) data. Table 1 summarizes the real datasets used for our benchmarks. Examples drawn from the UCI Machine Learning Repository are not high-dimensional but well-known, well-understood, and thus serve as sanity checks. High-dimensional examples are drawn exclusively from cancer datasets. Additional details on the real and simulated data are available in Section B of the Supplementary Materials.

Our empirical studies compare sparsity-based and shrinkage-based approaches to variable selection. Specifically, we test VDA with the homogeneous, $[eqn]$ , and heterogeneous, $[eqn]$ , group variable selection constraints (5) and (6), respectively and compare with their convex analogues

[eqn]

[eqn]

and

[eqn]

Although $[eqn]$ differs significantly from $[eqn]$ , both constraint sets assume a given feature plays a similar role across groups. We refer to VDA with constraints (5) and (6) as VDA_HomL0_ and VDA_HetL0_, respectively. Here the abbreviations “Hom” and “Het” emphasize whether the model assumes homogeneous or heterogeneous classes. Similarly, VDA with constraints (13), (14), and (15) are dubbed VDA_L1_, VDA_HomL1_, and VDA_HetL1_, respectively. Lastly, we note that nonlinear VDA uses a Gaussian kernel (Radial Basis Function Kernel) in all our examples.

Simulation studies of sparse recovery

4.1.

Let us first investigate the numerical performance of sparse VDA methods, VDA_HomL0_ and VDA_HetL0_, using simulated data. In both cases, we consider $[eqn]$ predictors with multivariate normal distribution, $[eqn]$ , such that the covariance matrix $[eqn]$ follows a Toeplitz structure, $[eqn]$ . Of the 1000 predictors, only a small subset $[eqn]$ of size $[eqn]$ factors into establishing the ground truth for class labels. Specifically, we randomly sample a slope matrix $[eqn]$ with exactly $[eqn]$ nonzero rows, each corresponding to an informative variable, and assign classes via the decision rule

[eqn]

where we remind the reader that $[eqn]$ is the standard simplex in $[eqn]$ with standard basis vectors, $[eqn]$ , as vertices. The vertex label for each instance $[eqn]$ is denoted $[eqn]$ and is arranged as the $[eqn]$ row in a vertex label matrix $[eqn]$ . Unfortunately, $[eqn]$ is not necessarily the best fitting slope matrix since each prediction $[eqn]$ has expectation $[eqn]$ . We remedy this defect by computing a sparse shift matrix $[eqn]$ which satisfies the linear system

[eqn]

over the support set $[eqn]$ and update $[eqn]$ as $[eqn]$ . This modification shifts the expectation of $[eqn]$ towards the associated label $[eqn]$ . The noise matrix $[eqn]$ with independent and identically distributed entries $[eqn]$ allows us to control the extent to which $[eqn]$ and its support $[eqn]$ reflect the ground truth. Our simulations compute $[eqn]$ via its relationship to the Signal-to-Noise Ratio (SNR)

[eqn]

in which $[eqn]$ is column 𝑗 of the vertex label matrix $[eqn]$ .

Our simulation studies vary the number of samples $[eqn]$ , the number of classes $[eqn]$ , the Toeplitz matrix parameter $[eqn]$ , and the $[eqn]$ . These choices help us investigate the performance of VDA-based classifiers across a diverse set of scenarios, which consider underdetermined problems, correlations between predictive and noise variables, and over-lapping classes. To illustrate the predictive and variable selection capabilities of VDA with sparsity set constraints, our benchmarks report the following metrics

the true positive rate based on the ground truth $[eqn]$ established by our simulation model,the ratio of true positives to predicted positives (positive predictive value),out-of-sample classification errors based on a test set with 1000 samples, andtiming results that reflect total time spent to identify a support and fit a final model.

Under the assumption that a sparse support exists, our aim is to show that sparse linear VDA models reliably identify a small set of informative predictors while accurately predicting classes. By considering underdetermined and overdetermined regimes under different SNR and correlation levels, our results suggest a degree of consistency in recovering sparse supports in interesting and challenging scenarios pertinent to biomedical science applications.

Homogeneous classes

4.1.1.

We first consider structured sparsity in a setting in which distinct classes are predicted by the same set of features. In this setting, we say that classes are homogeneous in the sense that they share a common support set $[eqn]$ such that the linear prediction $[eqn]$ reflects the ground truth under the VDA model. We compare VDA with the sparse, nonconvex HomL0 penalty and shrinkage-based, convex L1 penalty against Multi-Group Discriminant Analysis (MGSDA) (Gaynanova et al., 2016) and a L1-regularized SVM (L1R-SVM) as implemented in LIBLINEAR (Fan et al., 2008). Notably, sparsity in MGSDA relies on shrinkage due to a Euclidean norm penalty rather than a lasso. For the VDA models, we consider a predictor $[eqn]$ to be selected if its corresponding row $[eqn]$ in the fitted slope matrix $[eqn]$ has nonzero norm. MGSDA and L1R-SVM have analogous selection criteria, as established in relevant literature and documentation. The reader should note that our VDA methods are implemented in the Julia programming language, whereas as MGSDA is provided as an R package and LIBLINEAR is a C library accessed via a Julia wrapper.

Table 2 reports timing results and classification error rates for simulated classification problems under the homogeneous class assumption. The VDA timing results suggest an approximately linear relationship with the number of samples. This is expected behavior since matrix multiplication by $[eqn]$ is among the most frequently occurring operations in our algorithms. The convex version of sparse VDA, VDA_L1_, is often the fastest method overall. Results for MGSDA mostly align with the VDA models except in the cases with $[eqn]$ and $[eqn]$ which we attribute to a larger number of iterations taken to fit the multiclass data. L1R-SVM appears as the slowest method due to it fitting multiple binary SVMs; this is an embarrassingly parallel problem and thus is easily improved. Readers should not draw strong inferences from these relative rankings as the results may easily change were one to implement every method in a common programming language and a common software design. Yet, the speed advantage of a single classifier (VDA and MGSDA) over a multi-classifier approach (SVM) is reflected in Table 2 as is well-established in literature. The apparent linear scaling of VDA with n will break down in the $[eqn]$ regime as the required singular value decomposition of $[eqn]$ will come to dominate the cost of our MM algorithms, specifically as the storage costs for $[eqn]$ and its decomposition overwhelm memory resources on a computer. A similar claim holds for 𝑝 and the $[eqn]$ regime. Hence, the current implementation of sparse VDA is limited by the cost of extracting a thin SVD.

Error rates in Table 2 are crucial in establishing validity of our simulation study. In the absence of theory, we require at a minimum that error rates decrease towards 0 as the number of samples n and the SNR increase. For example, the median classification error for MGSDA decreases from 57% to 0% as SNR increases from 0.1 to 10 in the setting with $[eqn]$ , $[eqn]$ , and $[eqn]$ . The trend also holds moving from $[eqn]$ to $[eqn]$ across each scenario. Increasing SNR in the underdetermined regime $[eqn]$ has apparently less influence on classification accuracy. This is expected because the underdetermined regime represents a challenging scenario in which classifiers are prone to overfitting. Surprisingly, the advantage of sparse VDA reemerges if one considers SNR and $[eqn]$ simultaneously. With $[eqn]$ and $[eqn]$ fixed, VDA_HomL0_ improves median error rates from 66% to 0% as both SNR and $[eqn]$ increase whereas error rates remain mostly unchanged for VDA_L1_, MGSDA, and L1R-SVM. In fact, increasing $[eqn]$ alone appears to improve classification error rates. This result can be understood by considering that, while noise variables become strongly correlated with informative predictors, some informative predictors also become strongly correlated with one another.

Fig. 2 and Fig. 3 aid us in validating the previous claim. Fig. 2 reports the true positive rate with respect to the ground truth sparse support. The true positive rate for all classifiers improves as both n and SNR increase as one might anticipate. This sanity check is reflected in Fig. 2 by noting the progressions along the x-axis (left to right), from circles to triangles, and from diamonds to stars in each subplot. The Euclidean and lasso penalties have an edge over our HomL0 method in terms of identifying true positives (see blue circles and triangles). However, Fig. 3 clarifies the advantage of the $[eqn]$ approach. Examining the positive predictive value of each classifier aids us in distinguishing between a method that may incorporate noise variables to improve prediction over those that recover the ground truth support. VDA with sparsity constraints, VDA_HomL0_, is appreciably more sensitive and specific than competing methods in identifying the ground truth sparse support. For example, the underdetermined regime $[eqn]$ is challenging for VDA_HomL0_ when the SNR equals 0.1 (Figure 3A, circles and diamonds). Yet, our method is aided by increasing the correlation strength between all predictors from $[eqn]$ to $[eqn]$ . In the overdetermined regime $[eqn]$ , a high degree of correlation obscures support recovery unless the underlying signal is strong (Figure 3A, 3B, orange markers). Comparing VDA_HomL0_ against VDA_L1_ further clarifies these results as the two methods differ only in the choice of penalty. The L1 version of VDA shrinks the fitted slope matrix $[eqn]$ so that the linear predictions based solely on the ground truth support $[eqn]$ , $[eqn]$ , are closer to the origin rather than the underlying simplex vertices. Thus, spurious predictors enter into the model’s predictions to compensate. In contrast, VDA_HomL0_ avoids shrinkage and does not incorporate spurious predictors through the same mechanism. Cross-referencing Fig. 3 against Table 2 suggests that our sparse VDA classifiers are both parsimonious and predictive. The difference in performance is stark in comparing error rates in the underdetermined regime with correlated predictors, a setting that is relevant to genomic applications. Notably, the shrinkage-based methods maintain an advantage in prediction accuracy when the signal-to-noise ratio is low (SNR = 0.1).

Heterogeneous classes

4.1.2.

Next, we exploit the flexibility of distance penalties to investigate class-specific variable selection. Our simulations now consider heterogeneous classes in the sense that each class has a possibly distinct set of predictive variables. Given a size $[eqn]$ of true, informative variables we randomly assign $[eqn]$ variables to each class so that $[eqn]$ is close to vertex $[eqn]$ whenever instance $[eqn]$ belongs to class $[eqn]$ . The simulation model is identical to that of the homogeneous setting except for the assignment of informative variables to each class. We compare four VDA models with distance penalties based on HomL0 (Equation (5)), HetL0 (Equation (6)), L1 (Equation (13)), and HetL1 (Equation (15)). Table 3 reports timing results and classification error rates which mirror that of Table 2. The shrinkage-based penalties, L1 and HetL1, are superior to our sparsity-based penatlies in the low SNR setting. However, the shrinkage-based VDA methods generally select additional variables outside the ground truth support as reflected in Fig. 4 and Fig. 5. In particular, Fig. 5 illustrates that aggregate positive predictive value of our proposed class-specific method, VDA_HetL0_, generally outperforms the structured and unstructured shrinkage methods VDA_HetL1_ and VDA_L1_, respectively. The HomL0 penalty explicitly assumes a homogeneous class structure and is thus prone to selecting extraneous predictors within each class.

Comparison with previous VDA methods

4.2.

Our next set of experiments consider applications to cancer microarray data, where it is common for the number of features to dominate the number of samples, $[eqn]$ . Here Table 4 reports results for distance-penalized linear VDA classifiers across 6 datasets for different cancers previously analyzed by Dettling and Bühlmann (2002). The leukemia, prostate, and colon datasets involve two classes while the rest involve multiple classes. Our error rates are similar to previous results for VDA_LE_ (Wu and Lange, 2010), which uses a hybrid penalty combining lasso and Euclidean penalties, and hierarchical clustering (Dettling and Bühlmann, 2002). Although our cross validation errors are not directly comparable to the best leave-one-out errors noted in Table 7 of Dettling and Bühlmann (2002), we succeed in selecting models with fewer active genes than reported in Table 9. The left-skewed distributions in Fig. 6 support this contention in spite of the wide ranges reported for brain and prostate. While the selected subset size distribution for the sparsity-based methods is wider compared to VDA_LE_, we see that class-specific variable selection yields tighter intervals and improves classification accuracy. VDA_LE_ outperforms our VDA models on the colon and prostate examples in part due to its loss function, which is robust to outliers by virtue of not squaring errors, and its penalty, which interpolates between lasso and Euclidean penalties.

Linear classifier benchmarks

4.3.

We now turn our attention to fitting linear classifiers to various datasets from the UCI Machine Learning Repository (Dua and Graff, 2019). In each example, we take care to fit VDA classifiers across a spectrum of possible model sizes $[eqn]$ that cover about 50% of the space on a logarithmic scale. We note that, except for the TCGA-HiSeq and HAR datasets, generating 50 cross validation replicates requires a few minutes of computing. Table 5 records our results, which includes comparisons to L1-regularized SVM using the One-Versus-Rest (OVR) approach to multiclass data. As a reminder, an SVM using OVR requires c SVMs to classify data on c classes. It is known that SVM-OVR is sensitive to hyperparameter tuning (Rifkin and Klautau, 2004) so we consider our classification results comparable to VDA. A merit of our sparse VDA algorithms is the fact that it deals with all class simultaneously in a single model and is thus easy to fit without parallel processing beyond BLAS and more interpretable.

A few examples are noteworthy. Interestingly, most of the smaller datasets such as iris can be fitted with fewer features. In contrast, the letters example consistently fits with all 16 features, which were selected by experts to aid in letter recognition. The features in the splice dataset are 60 sites on a DNA molecule occupied by a T, C, G, or A nucleotide, represented via one-hot coding per site. This gives a total of 240 features. Our sparsity-based VDA classifiers successfully identify sequences as intron-exon, exon-intron, or nonsplice junctions using only 10 to 35 sites. In the optdigits example, it is conceivable that features from the dataset, which summarize white pixel counts of 32 × 32 images along 4×4 blocks, are in fact redundant. This explains why our results indicate that roughly 55 to 62 out of 64 pixel values are needed to correctly discriminate most of the digits in the 1797 test samples. In the higher dimensional examples HAR and TCGA-HiSeq, we note that the class-specific, shrinkage-based method HetL1 struggles to eliminate features compared to our sparsity-based methods, HomL0 and HetL0. This is most pronounced in the RNA expression data where one expects only a small subset of genes to be relevant in each cancer and thus suggests that the sparsity-based VDA classifiers have better control over false positives compared to shrinkage-based methods. Note that we report more selected variables for HetL0 (516 genes) and HetL1 (6449 genes) compared to HomL0 (326 genes) because the former two select distinct, potentially non-overlapping sets for each class. Dividing the values for HetL0 and HetL1 by the number of classes $[eqn]$ suggests the shrinkage-based penalty is in fact selecting more variables across classes.

Lastly, we are pleased that sparse VDA performs well on the TCGA-HiSeq dataset. This dataset contains gene expression patterns in 801 tumors measured by RNA sequencing technology. Tumors are labeled as one of five types of cancers: breast invasive carcinoma (BRCA), kidney renal clear cell carcinoma (KIRC), colorectal adenocarcinoma (COAD), lung adenocarcinoma (LUAD), and prostate adenocarcinoma (PRAD). In this setting, it is expected that only a subset of genes serves to discriminate between cancer types. Our VDA algorithms reliably select anywhere between 57 to 2336 genes out of the entire set of 20264. To better understand how class-specific variable selection aids VDA in classification, we examine the selected gene subsets across 10 cross validation replicates. For a given subset within a class, we rank each gene by how frequently it is replicated in cross validation. As Fig. 7 illustrates, roughly 25 genes are selected consistently within each cancer class. It is likely that the less replicated genes are subject to some degree of correlation and are therefore harder to recover. Moreover, it is well-established that expression or nonexpression of a particular gene may have distinct consequences in different tissues and disease states. Thus, our interpretation of the selected subset is that the selected genes are merely useful in discriminating between the 5 cancer types using gene expression patterns and are not necessarily linked to causal mechanisms. The sign and magnitude of each gene’s slope coefficient may aid practitioners in generating interesting hypotheses for follow-up studies. Incidentally, many of the selected genes are designated as “cancer enhanced” or “cancer enriched” by The Human Protein Atlas (Uhlen et al., 2017). For example:

The transcription factor HAND2, which was selected in both breast (BRCA), colon (COAD), and lung (LUAD) cancers, is enriched in breast fibroblasts but not necessarily enriched in either cancer type.The transcription factor CDX1, selected in the colon adenocarcinoma cancer class (COAD), is expressed in the intestine and enriched in colorectal cancer.The gene product of NKX2-1, selected in the lung adenocarcinoma cancer class (LUAD), is expressed in lung tissue, for example alveolar cells, as well as thyroid gland tissue. It is enirched in both lung and thyroid cancers and specifically designated a cancer-related gene.In prostate adenocarcinoma (PRAD), the family of KLK genes KLK2, KLK3, and KLK4 are enriched in both the prostate and prostate cancer. They are designated as cancer-related genes as it is highly expressed in prostate tumors, but it is not a prognostic marker.

Nonlinear classifier benchmarks

4.4.

Next, we consider applications of sparse VDA to selecting the avatars in nonlinear, kernel-based VDA. Sparse models have fewer avatars. Table 6 summarizes our results. Nonlinear VDA is most successful in estimating a flexible decision boundary with few avatars in the simulated datasets clouds, circles, spiral, and spiral-hard datasets. The reported testing error interval for circles compares favorably to error rates previously reported by Wu and Wu (2012), even though it is larger than the irreducible error (20%) used in generating the data. The testing subset for dataset vowel is, by design, more difficult than the training data so we naturally expect a relatively higher classification error. Our improvement in median prediction error (5%) over the linear VDA classifiers in Table 5 (52%) on the vowel example is consistent with previous findings (Wu and Wu, 2012) for nonlinear VDA (VDA_K_). In each example, our sparse VDA classifiers select a few instances to serve as avatars driving a nonlinear decision boundary. Unlike the linear case, the use of class-specific avatar selection does not appear to improve nonlinear VDA prediction accuracy, as we attribute the improvement in the vowel dataset to tuning the kernel in nonlinear VDA.

Molecular subtypes in breast cancer

4.5.

Molecular characterization of various cancers developed over the last two decades to better understand biological processes underlying disease and identify potential therapeutic targets (Denkert and Loibl, 2022). Subtyping of breast cancer is particularly notable as it evolved from classic immunohistochemistry markers to prognostic tools incorporating genetic signatures to predict recurrence risk. Modern efforts seek enhance molecular subtypes by linking gene expression profiles, immune cell infiltration, and other data to response outcomes (Denkert and Loibl, 2022; Wolf et al., 2022). Two important aspects are to address the existence of multiple classes and to identify small, stable subsets of features for downstream analyses robust to various batch effects (Szymiczek et al., 2021).

In this example we focus on predicting PAM50 molecular subtypes (Parker et al., 2009) for $[eqn]$ breast cancer tumors in TCGA using $[eqn]$ genes quantified via RNA sequencing. The dataset, TCGA-BRCA, recently received updates to normalize data and to include various histologic annotations which are of interest in understanding rarer forms of the disease (Thennavan et al., 2021). We focus on the molecular subtypes, Basal-like, Luminal A, Luminal B, HER2, and Normal-like, which in turn are not rigid classes but instead exhibit their own heterogeneity. Application of our sparse VDA methods to this example illustrates the utility of incorporating sparsity constraints into VDA for the purpose of extracting small gene subsets in the presence of ambiguity. Table 7 reports that our group-specific variable selection, HetL0, identifies a small, stable set of genes across 10 cross validation replicates. In comparison, the HetL1 VDA struggles to select sparse sets of genes in spite of its slight edge over L1R-SVM in terms of accuracy. It should also be noted that the dataset has gross class imbalance due to the ill-defined Normal-like class, which is represented by a mere 40 samples. Class imbalance likely factors into the stability of selection. Table 7 also indicates that class-heterogeneous sparsity recovers a far smaller, stable set compared to its homogeneous counterpart. Both methods are considerably more conservative compared to HetL1 VDA at the expense of some classification accuracy. It is unclear in this example whether either of the three VDA approaches have an obvious advantage. Indeed, the results for L1R-SVM suggest that careful tuning may allow the HetL1 VDA achieve a sparser subset via shrinkage. Interested readers may consult the appendix for notes on preprocessing using TCGABiolinks (Colaprico et al., 2016; Mounir et al., 2019) and a brief summary of overlap with the PAM50 gene set are available in Appendix D.

Ancestry-informative markers in human populations

4.6.

Our final example applies sparse VDA to a subset of data from the 1000 Genomes Project (1kGP) (Genomes Project Consortium, 2015) for the purpose of extracting a sparse set of features predictive of human ancestry. The source data is based on the 2012 Omni Platform genotypes (pairs of alleles at various loci along DNA) restricted to unrelated human individuals (1,718 people) from 26 populations. Each included individual has a genotyping success rate of at least 95%. Data are further restricted to DNA sites with minor allele frequencies of at least 1%, resulting in 1,854,622 single nucleotide polymorphisms (SNPs) distributed across 22 autosomes to serve as features. The 1kGP represents an international effort to catalogue and analyze data on human variation for the purpose of understanding human genetic histories. In biomedical applications, adjusting for genetic ancestries is paramount as population stratification is known to induce false associations between SNPs, here treated as predictors, and medically-relevant traits, here treated as response variables (Li, 1969; Knowler et al., 1988; Marchini et al., 2004). Due to the size of the human genome, data on human variation at the DNA level are universally underdetermined. It therefore of interest to identify ancestry-informative markers (AIMs) (Shriver et al., 2003; Brown and Pasaniuc, 2014), small subsets of SNPs with strong correlations to ancestral groups, for the purpose of improving statistical estimation of admixture coefficients. A classical approach to ancestry adjustment involves extracting a few principal components of the SNP data to serve as predictors in an analysis. While successful, this approach is criticized for its weak interpretability and potential inefficienies in correcting bias (Lawson et al., 2020). Modern data mining techniques make it possible to extract AIMs to satisfy parsimony and interpretability requirements.

Our application of VDA to the 1kGP data focuses exclusively on Chromosome 1 in $[eqn]$ unrelated individuals and $[eqn]$ SNPs. Further, our analysis considers groupings of the 26 populations into 5 superpopulations: African, admixed American, East Asian, European, and South Asian. These superpopulations are not our own designations; they are available in the original data. Our demonstration is restricted to less than 10% of the original features. This is due to the SNP data being encoded as matrices with entries in {0, 1, 2}, understood as unsigned integers, for memory efficiency purposes. Linear Algebra Package (LAPACK) distributions do not provide subroutines for computing SVDs of such matrices, which are required by the present version of sparse VDA. Restricting our example to Chromosome 1 allows us to cast the data in double-precision floating point. Nevertheless, our results make a compelling case for the variable selection potential of sparsity constraints on data like the 1kGP; see for example a recent hard clustering approach due to Ko et al. (2023). Table 8 reports classification accuracy for HomL0, HetL0, and HetL1. The class-specific HetL0 VDA, which selects different subsets of variables for each class, has a slight edge in classification accuracy. However, as in Table 7, the size of the most stable subset of genes across cross validation replicates is considerably smaller for HetL0 compared to HomL0. This is likely an artifact due to our grid search heuristic which allows the features to vary across classes but imposes equal subset sizes in each group. Thus, while a greedy search based on explicit sparsity constraints may recover sparse models compared to shrinkage-based methods, this example demonstrates that the heuristic may introduce false positives in group variable selection.

Discussion and concluding remarks

The novelty of the current version of VDA stems from the application of 𝜖-insensitive loss, a new majorization, distance penalties, and projection onto sparsity sets. Sequential minimization via the surrogate function (4) reduces to least squares with shifted responses for fitting VDA classifiers.

Our work has its limitations. While our empirical results are satisfying, we have not characterized the statistical properties of our estimators. We lack conditions implying consistency of sparse VDA. Statistical theory is also needed to understand stability, or lack thereof, of our greedy variable selection procedure. Yet, Fig. 6 does suggest that VDA is strongly biased towards selecting sparse models on datasets with a small number of informative features. Unfortunately, cross validation replicates often yield candidate models with different numbers of active features. Dense models can be recovered, as illustrated by the prostate cancer example summarized in Fig. 6. In practice only one or just a few driving features may be selected out of a cluster of highly correlated features. This tendency may be beneficial in finding causal features in biological models. Finally, our code implementation relies heavily on BLAS calls to exploit parallelism in linear algebra operations. Additional parallel structure is potentially neglected.

We defer reporting timing benchmarks to Section C of the Supplementary Materials. Briefly, the most time consuming dataset to fit was TCGA-HiSeq, which required about 7 minutes per replicate on average. We emphasize that this feat involves solving thousands of nonconvex optimization problems as our algorithms (i) follow an annealing path to enforce sparsity via the strength of distance penalties, and (ii) search for an optimal model size. As sparse VDA moves along its solution path, it is only at the smallest model sizes, where classification is often nearly impossible, that our algorithms dramatically slow down. In addition, we note that the initial tuning phase is usually quite fast in practice. Our experience suggests that fitting sparse linear VDA models is much faster than fitting nonlinear VDA models. Whereas our proposed methods lack statistical guarantees, they are supported by theoretical guarantees on algorithmic convergence from the proximal disatance algorithms literature.

Repeated cross validation is not the only viable technique for evaluating candidate models. For instance, the method of stability selection described by Meinshausen and Bühlmann (2010) is attractive for its theoretical finite-sample guarantees in controlling family-wise errors. The percentile lasso is another attractive device for stabilizing model selection (Roberts and Nowak, 2014). The percentile lasso attempts to control variability by using the median of selected hyperparameter values. Both methods mentioned here, unfortunately, do not necessarily scale to truly high dimensional problems in which the number of covariates far exceeds the available number of samples. We fully recognize that our own sparse VDA classifiers are similarly limited in this regime because (a) the onerous demands of grid optimization to tune the optimal model size $[eqn]$ are prohibitive without some prior information, and (b) the update step in our algorithms assumes the design matrix $[eqn]$ is dense. Pioneering work on distance-to-set priors (Presman and Xu, 2023) may offer insights into incorporating Bayesian inference into proximal distance methods, thus addressing the former concern. Screening techniques that capitalize on recent advances in variable selection (Fan and Lv, 2008; Fan et al., 2009) can assist in reliably reducing the number of covariates in high dimensional datasets (Xie et al., 2020).

Finally, we draw the readers attention to how mixed integer programming can significantly improve the computational tractability of best subset selection in sparse regression (Bertsimas et al., 2016; Bertsimas and King, 2017; Bertsimas and Parys, 2020). Hastie et al. (2020) documented the beneficial effects of mixed integer programming in $[eqn]$ -constrained least squares over penalized least squares, including robust nonconvex penalties such as SCAD (Fan and Li, 2001) and MCP (Zhang, 2010). Bertsimas et al. (2021) recently applied their discrete optimization techniques to sparse binary classification. At the time of our writing, the software provided by Bertsimas et al. (2021) only supports binary classification. Thus, we omit comparisons to this body of literature. However, our own experiments echo an important observation reported by Hastie et al. (2020) with respect to support recovery; namely, shrinkage-based estimators tend to perform relatively well in the low SNR regime while $[eqn]$ -based methods often outperform the lasso in the high SNR regime.

In conclusion, we have demonstrated both the prediction accuracy and the feature selection capabilities of our sparse VDA algorithms across a range of simulated and real datasets. Our numerical examples confirm the promise of sparsity-based variable selection in VDA via distance penalties. Combining our estimation procedures with cross validation overcomes the need to specify a particular subset of variables. Replicating cross validation further improves the reliability of variable selection with sparse VDA and provides a straightforward mechanism for honest reporting. In the case of kernel VDA, our proposed methods allow for the simultaneous selection of the support vectors defining the decision boundaries separating multiple classes. We show that sparse VDA can recover a relatively stable set of features even in the underdetermined regime. In addition, we demonstrate the ease of designing penalties for class-specific variable selection and provide empirical evidence that our penalties can improve prediction and interpretation in classification. We hope to address gaps in theory and practice in future publications. In the meantime we hope readers will agree that sparse VDA clustering is ripe for further applications.

Supplementary Material

supple

Bibliography76

The reference list from the paper itself. Each links out to its DOI / PubMed record.

11000 Genomes Project Consortium, 2015. A global reference for human genetic variation. Nature 526, 68–74. 10.1038/nature 15393.26432245 PMC 4750478 · doi ↗ · pubmed ↗
2Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO, Staudt LM, 2000. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503–511. 10.1038/35000501.10676951 · doi ↗ · pubmed ↗
3Allwein EL, Schapire RE, Singer Y, 2000. Reducing multiclass to binary: a unifying approach for margin classifiers. J. Mach. Learn. Res 1, 113–141.
4Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ, 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci 96, 6745–6750. 10.1073/pnas.96.12.6745.10359783 PMC 21986 · doi ↗ · pubmed ↗
5Beltrami EJ, 1970. An Algorithmic Approach to Nonlinear Analysis and Optimization. Academic Press.
6Bertsimas D, King A, 2017. Logistic regression: from art to science. Stat. Sci 32, 367–384. 10.1214/16-STS 602. · doi ↗
7Bertsimas D, King A, Mazumder R, 2016. Best subset selection via a modern optimization lens. Ann. Stat 44, 813–852. 10.1214/15-AOS 1388. · doi ↗
8Bertsimas D, Parys BV, 2020. Sparse high-dimensional regression: exact scalable algorithms and phase transitions. Ann. Stat 48, 300–323. 10.1214/18-AOS 1804. · doi ↗