Primal path algorithm for compositional data analysis

Jong-June Jeon; Yongdai Kim; Sungho Won; Hosik Choi

arXiv:1812.08954·cs.LG·December 24, 2018

Primal path algorithm for compositional data analysis

Jong-June Jeon, Yongdai Kim, Sungho Won, Hosik Choi

PDF

Open Access

TL;DR

This paper introduces an efficient solution path algorithm for $l_1$ regularized regression and classification models tailored for compositional data, addressing computational challenges in high-dimensional settings.

Contribution

It develops a novel, faster algorithm for compositional data analysis that handles linear constraints more efficiently than existing methods.

Findings

01

Algorithm significantly reduces computation time.

02

Effective in high-dimensional microbiome data analysis.

03

Extended to classification tasks with compositional predictors.

Abstract

Compositional data have two unique characteristics compared to typical multivariate data: the observed values are nonnegative and their summand is exactly one. To reflect these characteristics, a specific regularized regression model with linear constraints is commonly used. However, linear constraints incur additional computational time, which becomes severe in high-dimensional cases. As such, we propose an efficient solution path algorithm for a $l_{1}$ regularized regression with compositional data. The algorithm is then extended to a classification model with compositional predictors. We also compare its computational speed with that of previously developed algorithms and apply the proposed algorithm to analyze human gut microbiome data.

Tables3

Table 1. Table 1: Average computation time and standard errors (between parentheses)

	$n = 50$			$n = 100$
$p$	200	500	1000	200	500	1000
comlasso	0.23 (0.01)	0.22 (0.00)	0.22 (0.00)	0.52 (0.01)	0.54 (0.01)	0.55 (0.01)
genlasso	0.27 (0.01)	3.25 (0.06)	19.23 (0.33)	0.43 (0.01)	4.18 (0.07)	30.44 (0.34)

Table 2. Table 2: Average computation time and standard errors when p = 1000 𝑝 1000 p=1000

	$n = 50$			$n = 100$
$K$	20	50	100	20	50	100
comlasso	0.26 (0.01)	0.23 (0.00)	0.24 (0.01)	0.67 (0.02)	0.83 (0.02)	0.90 (0.02)
genlasso	21.93 (0.48)	20.57 (0.36)	18.84 (0.36)	35.16 (0.53)	36.75 (0.67)	32.49 (0.55)

Table 3. Table 3: Genera lists using three methods: Selection probability for comlassoA and comlassoB and p-value for the marginal Wilcoxon test. Between parentheses, the letters indicate the phylum level as follows: A stands for “Actinobacteria;” B for “Bacteroidetes;” F for “Firmicutes;” and P for “Proteobacteria.”

	comlassoA	comlassoB	Wilcoxon test
No.	Genera (sel. Prob.)	Genera (sel. Prob.)	Genera (p-value)
1	F. Lactobacillus (0.744)	F. Lactobacillus (0.645)	F. Faecalibacterium (0.001)
2	A. Bifidobacterium (0.468)	F. Phascolarctobacterium (0.357)	P. Sutterella (0.003)
3	P. Sutterella (0.418)	F. Faecalibacterium (0.334)	F. Oscillospira (0.004)
4	B. Parabacteroides (0.402)	F. Christensenella (0.324)	F. Dehalobacterium (0.006)
5	P. Desulfovibrio (0.360)	F. Roseburia (0.314)	P. Oxalobacter (0.006)
6	F. Enterococcus (0.341)	A. Eggerthella (0.297)	P. Desulfovibrio (0.008)
7	F. Turicibacter (0.328)	A. Bifidobacterium (0.274)	B. Butyricimonas (0.022)
8	A. Eggerthella (0.290)	B. Prevotella (0.260)	F. Christensenella (0.022)
9	A. Collinsella (0.270)	F. Clostridium (0.250)	B. Parabacteroides (0.053)
10	F. Roseburia (0.264)	F. Enterococcus (0.240)	F. Lactobacillus (0.053)

Equations67

y

y

\displaystyle\hat{}\mbox{\boldmath{$\beta$}}(\lambda)

\displaystyle\hat{}\mbox{\boldmath{$\beta$}}(\lambda)

\displaystyle\hat{}\mbox{\boldmath{$\beta$}}(\lambda)

\displaystyle\hat{}\mbox{\boldmath{$\beta$}}(\lambda)

\displaystyle L(\mbox{\boldmath{$\beta$}})+\lambda\|\mbox{\boldmath{$\beta$}}\|_{1}+\mbox{\boldmath{$\mu$}}^{\top}D^{\top}\mbox{\boldmath{$\beta$}},

\displaystyle L(\mbox{\boldmath{$\beta$}})+\lambda\|\mbox{\boldmath{$\beta$}}\|_{1}+\mbox{\boldmath{$\mu$}}^{\top}D^{\top}\mbox{\boldmath{$\beta$}},

\displaystyle D=\left(\begin{array}[]{cccc}{\bf d}_{1}&{\bf 0}_{p_{1}}&\cdots&{\bf 0}_{p_{1}}\\ {\bf 0}_{p_{2}}&{\bf d}_{2}&\cdots&{\bf 0}_{p_{2}}\\ \vdots&\vdots&\vdots&\vdots\\ {\bf 0}_{p_{K}}&{\bf 0}_{p_{K}}&\cdots&{\bf d}_{K}\end{array}\right),

\displaystyle D=\left(\begin{array}[]{cccc}{\bf d}_{1}&{\bf 0}_{p_{1}}&\cdots&{\bf 0}_{p_{1}}\\ {\bf 0}_{p_{2}}&{\bf d}_{2}&\cdots&{\bf 0}_{p_{2}}\\ \vdots&\vdots&\vdots&\vdots\\ {\bf 0}_{p_{K}}&{\bf 0}_{p_{K}}&\cdots&{\bf d}_{K}\end{array}\right),

\displaystyle\nabla_{{\cal A}}L(\mbox{\boldmath{$\beta$}})+(D\mbox{\boldmath{$\mu$}})_{{\cal A}}+\lambda{\rm sign}(\mbox{\boldmath{$\beta$}}_{{\cal A}})

\displaystyle\nabla_{{\cal A}}L(\mbox{\boldmath{$\beta$}})+(D\mbox{\boldmath{$\mu$}})_{{\cal A}}+\lambda{\rm sign}(\mbox{\boldmath{$\beta$}}_{{\cal A}})

\displaystyle|\nabla_{{\cal A}^{c}}L(\mbox{\boldmath{$\beta$}})+(D\mbox{\boldmath{$\mu$}})_{{\cal A}^{c}}|

\displaystyle D^{\top}\mbox{\boldmath{$\beta$}}

\displaystyle{\cal E}_{\lambda}(\mbox{\boldmath{$\beta$}})=\{\mbox{\boldmath{$\mu$}}\in\mathbb{R}^{K}:|\nabla L(\mbox{\boldmath{$\beta$}})+D\mbox{\boldmath{$\mu$}}|\preceq\lambda{\bf 1}_{p}\}.

\displaystyle{\cal E}_{\lambda}(\mbox{\boldmath{$\beta$}})=\{\mbox{\boldmath{$\mu$}}\in\mathbb{R}^{K}:|\nabla L(\mbox{\boldmath{$\beta$}})+D\mbox{\boldmath{$\mu$}}|\preceq\lambda{\bf 1}_{p}\}.

j^{'} : j^{'} \in G_{k}, d_{j^{'}} \neq = 0 max (- \frac{λ}{∣ d _{j^{'}} ∣} - \frac{\nabla _{j^{'}} L ( 0 _{p} )}{d _{j^{'}}}) \leq j : j \in G_{k}, d_{j} \neq = 0 min (\frac{λ}{∣ d _{j} ∣} - \frac{\nabla _{j} L ( 0 _{p} )}{d _{j}}),

j^{'} : j^{'} \in G_{k}, d_{j^{'}} \neq = 0 max (- \frac{λ}{∣ d _{j^{'}} ∣} - \frac{\nabla _{j^{'}} L ( 0 _{p} )}{d _{j^{'}}}) \leq j : j \in G_{k}, d_{j} \neq = 0 min (\frac{λ}{∣ d _{j} ∣} - \frac{\nabla _{j} L ( 0 _{p} )}{d _{j}}),

λ_{m a x} = 1 \leq k \leq K max (j, j^{'}) : j, j^{'} \in G_{k}, d_{j}, d_{j^{'}} \neq = 0 max (\frac{\nabla _{j^{'}} L ( 0 _{p} )}{d _{j^{'}}} - \frac{\nabla _{j} L ( 0 _{p} )}{d _{j}}) (\frac{1}{∣ d _{j^{'}} ∣} + \frac{1}{∣ d _{j} ∣})^{- 1} .

λ_{m a x} = 1 \leq k \leq K max (j, j^{'}) : j, j^{'} \in G_{k}, d_{j}, d_{j^{'}} \neq = 0 max (\frac{\nabla _{j^{'}} L ( 0 _{p} )}{d _{j^{'}}} - \frac{\nabla _{j} L ( 0 _{p} )}{d _{j}}) (\frac{1}{∣ d _{j^{'}} ∣} + \frac{1}{∣ d _{j} ∣})^{- 1} .

\displaystyle-\sum_{i=1}^{n}\left(2a(r_{i}(\lambda))r_{i}(\lambda)+b(r_{i}(\lambda))\right){\bf x}_{i{\cal A}}+(D\mbox{\boldmath{$\mu$}}(\lambda))_{{\cal A}}+\lambda\mbox{sign}(\mbox{\boldmath{$\beta$}}_{{\cal A}}(\lambda))

\displaystyle-\sum_{i=1}^{n}\left(2a(r_{i}(\lambda))r_{i}(\lambda)+b(r_{i}(\lambda))\right){\bf x}_{i{\cal A}}+(D\mbox{\boldmath{$\mu$}}(\lambda))_{{\cal A}}+\lambda\mbox{sign}(\mbox{\boldmath{$\beta$}}_{{\cal A}}(\lambda))

\displaystyle D^{\top}\mbox{\boldmath{$\beta$}}(\lambda)

\displaystyle\mbox{\boldmath{$\beta$}}_{{\cal A}}(\lambda-\delta)

\displaystyle\mbox{\boldmath{$\beta$}}_{{\cal A}}(\lambda-\delta)

\displaystyle\mbox{\boldmath{$\beta$}}_{{\cal A}^{c}}(\lambda-\delta)

\displaystyle\mbox{\boldmath{$\mu$}}_{{\cal K}}(\lambda-\delta)

\displaystyle\left(\begin{array}[]{c c}H(\lambda)&D_{{\cal A}}\\ (D^{\top})_{{\cal A}}&O\end{array}\right)\left(\begin{array}[]{c}{\bf b}(\lambda)\\ {\bf m}(\lambda)\end{array}\right)=\left(\begin{array}[]{c}\mbox{sign}(\mbox{\boldmath{$\beta$}}_{{\cal A}}(\lambda))\\ {\bf 0}_{|\mathcal{K}|}\end{array}\right),

\displaystyle\left(\begin{array}[]{c c}H(\lambda)&D_{{\cal A}}\\ (D^{\top})_{{\cal A}}&O\end{array}\right)\left(\begin{array}[]{c}{\bf b}(\lambda)\\ {\bf m}(\lambda)\end{array}\right)=\left(\begin{array}[]{c}\mbox{sign}(\mbox{\boldmath{$\beta$}}_{{\cal A}}(\lambda))\\ {\bf 0}_{|\mathcal{K}|}\end{array}\right),

\displaystyle\nabla_{{\cal A}}L(\mbox{\boldmath{$\beta$}}(\lambda-\delta))+(D_{{\cal K}}\mbox{\boldmath{$\mu$}}_{{\cal K}}(\lambda-\delta))_{{\cal A}}+(\lambda-\delta){\rm sign}(\mbox{\boldmath{$\beta$}}_{{\cal A}}(\lambda-\delta))

\displaystyle\nabla_{{\cal A}}L(\mbox{\boldmath{$\beta$}}(\lambda-\delta))+(D_{{\cal K}}\mbox{\boldmath{$\mu$}}_{{\cal K}}(\lambda-\delta))_{{\cal A}}+(\lambda-\delta){\rm sign}(\mbox{\boldmath{$\beta$}}_{{\cal A}}(\lambda-\delta))

\displaystyle\min_{\beta}L(\mbox{\boldmath{$\beta$}})+(\lambda-\delta)\|\mbox{\boldmath{$\beta$}}\|_{1}

\displaystyle\min_{\beta}L(\mbox{\boldmath{$\beta$}})+(\lambda-\delta)\|\mbox{\boldmath{$\beta$}}\|_{1}

\displaystyle\mbox{subject to }D^{\top}\mbox{\boldmath{$\beta$}}=0,

|\nabla_{{\cal A}^{c}}L(\mbox{\boldmath{$\beta$}}(\lambda-\delta))+(D\mbox{\boldmath{$\mu$}}(\lambda-\delta))_{{\cal A}^{c}}|\preceq(\lambda-\delta)\bf{1}_{|{\cal A}^{c}|}.

|\nabla_{{\cal A}^{c}}L(\mbox{\boldmath{$\beta$}}(\lambda-\delta))+(D\mbox{\boldmath{$\mu$}}(\lambda-\delta))_{{\cal A}^{c}}|\preceq(\lambda-\delta)\bf{1}_{|{\cal A}^{c}|}.

a (r_{i} (λ - δ)) \neq = a (r_{i} (λ)), \mbox or b (r_{i} (λ - δ)) \neq = b (r_{i} (λ)) \mbox f or so m e I,

a (r_{i} (λ - δ)) \neq = a (r_{i} (λ)), \mbox or b (r_{i} (λ - δ)) \neq = b (r_{i} (λ)) \mbox f or so m e I,

|\nabla_{j}L(\mbox{\boldmath{$\beta$}}(\lambda-\delta))+(D\mbox{\boldmath{$\mu$}}(\lambda-\delta))_{j}|>(\lambda-\delta)

|\nabla_{j}L(\mbox{\boldmath{$\beta$}}(\lambda-\delta))+(D\mbox{\boldmath{$\mu$}}(\lambda-\delta))_{j}|>(\lambda-\delta)

\displaystyle\sum_{i=1}^{n}\left(2a(r_{i}(\lambda)){\bf x}_{i{\cal A}}{\bf x}_{i{\cal A}}^{\top}\mbox{\boldmath{$\beta$}}(\lambda)\right)+b(r_{i}(\lambda))y_{i}{\bf x}_{i{\cal A}})+(D\mbox{\boldmath{$\mu$}}(\lambda))_{{\cal A}}+\lambda\mbox{sign}(\mbox{\boldmath{$\beta$}}(\lambda)_{{\cal A}})

\displaystyle\sum_{i=1}^{n}\left(2a(r_{i}(\lambda)){\bf x}_{i{\cal A}}{\bf x}_{i{\cal A}}^{\top}\mbox{\boldmath{$\beta$}}(\lambda)\right)+b(r_{i}(\lambda))y_{i}{\bf x}_{i{\cal A}})+(D\mbox{\boldmath{$\mu$}}(\lambda))_{{\cal A}}+\lambda\mbox{sign}(\mbox{\boldmath{$\beta$}}(\lambda)_{{\cal A}})

\displaystyle d_{j}\mu_{k(j)}(\lambda-\delta)=-\nabla_{j}L(\mbox{\boldmath{$\beta$}}(\lambda-\delta))-\mbox{sign}(\beta_{j}(\lambda-\delta))(\lambda-\delta),

\displaystyle d_{j}\mu_{k(j)}(\lambda-\delta)=-\nabla_{j}L(\mbox{\boldmath{$\beta$}}(\lambda-\delta))-\mbox{sign}(\beta_{j}(\lambda-\delta))(\lambda-\delta),

\displaystyle-(\lambda-\delta)-\nabla_{j}L(\mbox{\boldmath{$\beta$}}(\lambda-\delta))\leq d_{j}\mu_{k(j)}(\lambda-\delta)\leq(\lambda-\delta)-\nabla_{j}L(\mbox{\boldmath{$\beta$}}(\lambda-\delta)),

\displaystyle-(\lambda-\delta)-\nabla_{j}L(\mbox{\boldmath{$\beta$}}(\lambda-\delta))\leq d_{j}\mu_{k(j)}(\lambda-\delta)\leq(\lambda-\delta)-\nabla_{j}L(\mbox{\boldmath{$\beta$}}(\lambda-\delta)),

δ \geq 0 max δ

δ \geq 0 max δ

\displaystyle\mbox{subject to }\max_{j\in{\cal G}_{k}}\frac{\nabla_{j}L(\mbox{\boldmath{$\beta$}}(\lambda-\delta))}{d_{j}}-\frac{(\lambda-\delta)}{|d_{j}|}

\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\leq\min_{j^{\prime}\in{\cal G}_{k}}\frac{\nabla_{j^{\prime}}L(\mbox{\boldmath{$\beta$}}(\lambda-\delta))}{d_{j^{\prime}}}+\frac{(\lambda-\delta)}{|d_{j^{\prime}}|},

\displaystyle\frac{\nabla_{j}L(\mbox{\boldmath{$\beta$}}(\lambda-\delta^{*}))}{d_{j}}-\frac{(\lambda-\delta^{*})}{|d_{j}|}=\frac{\nabla_{j^{\prime}}L(\mbox{\boldmath{$\beta$}}(\lambda-\delta^{*}))}{d_{j^{\prime}}}+\frac{(\lambda-\delta^{*})}{|d_{j^{\prime}}|},

\displaystyle\frac{\nabla_{j}L(\mbox{\boldmath{$\beta$}}(\lambda-\delta^{*}))}{d_{j}}-\frac{(\lambda-\delta^{*})}{|d_{j}|}=\frac{\nabla_{j^{\prime}}L(\mbox{\boldmath{$\beta$}}(\lambda-\delta^{*}))}{d_{j^{\prime}}}+\frac{(\lambda-\delta^{*})}{|d_{j^{\prime}}|},

\displaystyle\nabla_{j}L(\mbox{\boldmath{$\beta$}}(\lambda-\delta^{*}))+d_{j}\mu_{k}(\lambda-\delta^{*})-\mbox{sign}(d_{j})(\lambda-\delta^{*})=0

\displaystyle\nabla_{j}L(\mbox{\boldmath{$\beta$}}(\lambda-\delta^{*}))+d_{j}\mu_{k}(\lambda-\delta^{*})-\mbox{sign}(d_{j})(\lambda-\delta^{*})=0

\displaystyle\nabla_{j^{\prime}}L(\mbox{\boldmath{$\beta$}}(\lambda-\delta^{*}))+d_{j^{\prime}}\mu_{k}(\lambda-\delta^{*})+\mbox{sign}(d_{j^{\prime}})(\lambda-\delta^{*})=0,

\displaystyle\nabla_{j}L(\mbox{\boldmath{$\beta$}}(\lambda-\delta^{*}))+(D\mbox{\boldmath{$\mu$}}(\lambda-\delta^{*}))_{j}+(\lambda-\delta^{*})=0,

\displaystyle\nabla_{j}L(\mbox{\boldmath{$\beta$}}(\lambda-\delta^{*}))+(D\mbox{\boldmath{$\mu$}}(\lambda-\delta^{*}))_{j}+(\lambda-\delta^{*})=0,

\displaystyle\nabla_{j}L(\mbox{\boldmath{$\beta$}}(\lambda-\delta^{*}))+(D\mbox{\boldmath{$\mu$}}(\lambda-\delta^{*}))_{j}-(\lambda-\delta^{*})=0.

\displaystyle\nabla_{j}L(\mbox{\boldmath{$\beta$}}(\lambda-\delta^{*}))+(D\mbox{\boldmath{$\mu$}}(\lambda-\delta^{*}))_{j}-(\lambda-\delta^{*})=0.

\displaystyle t\hat{\mu}+(1-t)\tilde{\mu}=X_{j}^{\top}(Y-X\hat{}\mbox{\boldmath{$\beta$}})-\lambda\mbox{sign}(\hat{\beta}_{j}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGeochemistry and Geologic Mapping · Hydrocarbon exploration and reservoir analysis · Mineral Processing and Grinding

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings

Full text

Primal path algorithm for compositional data analysis

Jong-June Jeon

Department of Statistics & Natural Science Research Institute University of Seoul,

Yongdai Kim

Department of Statistics, Seoul National University,

Sungho Won

Department of Public Health Science, Seoul National University,

and Hosik Choi

Department of Applied Statistics, Kyonggi University

Abstract

Compositional data have two unique characteristics compared to typical multivariate data: the observed values are nonnegative and their summand is exactly one. To reflect these characteristics, a specific regularized regression model with linear constraints is commonly used. However, linear constraints incur additional computational time, which becomes severe in high-dimensional cases. As such, we propose an efficient solution path algorithm for a $l_{1}$ regularized regression with compositional data. The algorithm is then extended to a classification model with compositional predictors. We also compare its computational speed with that of previously developed algorithms and apply the proposed algorithm to analyze human gut microbiome data.

Keywords: penalized regression; constraint; solution path algorithm; microbiome

1 Introduction

In modern regression analysis, it is frequently observed that regression predictors consist of the proportions or relative ratios of certain values rather than absolute values. For example, in analyzing air pollution data, the percentages of chemicals in the air are considered relevant predictors to identify the source of a pollutant (Lee et al.,, 2007). These types of proportional data, typically called compositional data, are widely used in geoscience (Buccianti et al.,, 2006), microbiology (Montassier et al.,, 2016), and nutritional biochemistry (Leite,, 2016). By the definition of compositional data, all compositional predictors lie on the simplex and are thus linearly dependent.

Aitchison and Bacon-shone, (1984) proposed a regression model for compositional data as follows. Let ${\bf y}=(y_{1},\cdots,y_{n})^{\top}$ be a real valued response vector and $U=({\bf u}_{1},\cdots,{\bf u}_{n})^{\top}$ the predictors, where ${\bf u}_{i}$ is an element on the $p$ -dimensional standard simplex. Instead of modeling $\bf y$ on $U$ directly, Aitchison and Bacon-shone, (1984) introduced a specific transformation of $U$ , in which ${\bf x}_{i}=\log({\bf u}_{i})$ by the logarithm of each component of the vector, and proposed the following regression model with a constraint:

[TABLE]

where $X=({\bf x}_{1},\cdots,{\bf x}_{n})^{\top}$ , $\mbox{\boldmath{$ \beta $}}=(\beta_{1},\cdots,\beta_{p})^{\top}$ , ${\bf 1}=(1,\cdots,1)^{\top}\in\mathbb{R}^{p}$ , and $\epsilon$ is a $n$ -dimensional random vector with mean zero and finite variances. Subsequently, Lin et al., (2014) adopted (1) as a regression model with compositional data and proposed the $l_{1}$ regularized estimator of the regression coefficients, given by

[TABLE]

However, the standard optimization algorithm for the $l_{1}$ regularized estimator is not directly applicable to solving (1) due to constraint ${\bf{1}^{\top}}\mbox{\boldmath{$ \beta $}}=0.$ For example, the coordinate-wise algorithm (Friedman et al.,, 2010), one of the most popular algorithms for $l_{1}$ regularization, cannot be used because the $l_{1}$ penalty function is non-separable (Tseng and Yun,, 2009). Alternatively, quadratic programming (Brodie et al.,, 2009; Bondell and Reich,, 2009) or the alternating direction method of multipliers (ADMM) algorithm can be used (Lin et al.,, 2014; Fang et al.,, 2015). However, all these algorithms solve problem (1) for a fixed $\lambda$ and do not provide a solution path.

Various solution path algorithms for the $l_{1}$ regularized estimator with linear constraints have been developed as to provide a solution path by Tibshirani and Taylor, (2011), Zhou and Lange, (2013), and Zhou and Wu, (2014). However, these algorithms have their own problems when applied to high-dimensional data. For example, the generalized lasso (genlasso) algorithm of Tibshirani and Taylor, (2011) does not generally provide a solution path when $p>n$ and uses the additional $l_{2}$ penalty to solve this problem. However, adding the $l_{2}$ penalty not only increases computational complexity but also makes the corresponding solution path inaccurate (see Section 3 for details on this problem). The algorithms of Zhou and Lange, (2013), Zhou and Wu, (2014), and Gaines et al., (2018) solve general problems, including (1). However, a twice-differentiable loss function is required to solve the problem, which precludes further extensions such as a Huberized regression. Moreover, all aforementioned algorithms are not optimal for solving (1), which has specific constraints.

The aim of this paper is thus to propose an algorithm to construct a solution path $\{\hat{}\mbox{\boldmath{$ \beta $}}(\lambda):\lambda\geq 0\}$ , in which

[TABLE]

Where $L:\mathbb{R}^{n}\times\mathbb{R}^{n}\mapsto\mathbb{R}$ , $\mbox{\boldmath{$ \beta $}}=(\mbox{\boldmath{$ \beta $}}_{1}^{\top},\cdots,\mbox{\boldmath{$ \beta $}}_{K}^{\top})^{\top}\in\mathbb{R}^{p}$ with $\mbox{\boldmath{$ \beta $}}_{k}\in\mathbb{R}^{p_{k}}$ , and ${\bf{d}}_{k}\in\mathbb{R}^{p_{k}}$ . We let $L({\bf y},X\mbox{\boldmath{$ \beta $}})=\sum_{i=1}^{n}l(y_{i},{\bf x}_{i}^{\top}\mbox{\boldmath{$ \beta $}})$ and consider $l:\mathbb{R}\times\mathbb{R}\mapsto\mathbb{R}$ as the class of piecewise quadratic loss functions (Rosset and Zhu,, 2007) including quadratic, Huberized, and quadratic hinge loss. In the regression model, $y_{i}\in\mathbb{R}$ , and in the classification model, $y_{i}\in\{-1,1\}$ . Additionally, we introduce constant vectors ${\bf{d}}_{k}$ for $k=1,\cdots,K$ to consider the more general $l_{1}$ constraint in the problem. For example, multiple compositional predictors can be included in the model. We call the proposed algorithm the composite-lasso (comlasso).

The main contribution of this paper is clearly describing the event of violation for the Karush-Kuhn-Tucker (KKT) conditions corresponding to the linear constraint in (1). We monitor the event and employ the monitoring result to determine the solution path. This enables us to seek the exact initial solution referred to in Gaines et al., (2018) when $p>n$ and also provides an efficient way to find the exact activation set in the solution path. To the best of our knowledge, this is the first paper to provide an exact description of the violation. Therefore, comlasso has the following advantages compared to other algorithms: it is computationally more efficient, especially for high-dimensional data, requires no modification by adding the $l_{2}$ penalty, and works with various loss functions other than square loss. In this paper, the intercept of the model is not considered for convenience. However, the proposed algorithm can be easily modified to include the intercept.

The remainder of the paper is organized as follows. Section 2 explains the KKT conditions for problem (1) and describes the solution path generation rule. Section 3 provides the results of the simulation study and the microbiome data analysis. Section 4 presents concluding remarks.

Finally, we define the notations used in this paper. For a positive integer $k,$ ${\bf 1}_{k}$ and ${\bf 0}_{k}$ denote the $k$ -dimensional vector whose elements are $1$ and [math], respectively. For an index set $\mathcal{A}\subset\{1,\cdots,p\}$ , $v_{\mathcal{A}}$ denotes the subvector of $v$ , whose elements are chosen by index set $\mathcal{A}$ . Similarly $M_{\mathcal{A}}$ denotes the submatrix of $M$ , whose columns are drawn according to index set ${\cal A}$ . When ${\cal A}$ is singleton, we denote the index set by its element (e.g., $v_{j}$ and $M_{j}$ for ${\cal A}=\{j\}$ ). Let ${\cal G}_{k}$ for $k=1,\cdots,K$ be the consecutive partition of $\{1,\cdots,p\}$ , with $|{\cal G}_{k}|=p_{k}$ , where $|{\cal G}|$ is the cardinality of set ${\cal G}$ . Let $k(j)\in\{1,\cdots,K\}$ be the membership of $j\in\{1,\cdots,p\}$ , that is, $j\in{\cal G}_{k(j)}$ . We denote $L({{\bf y}},X\mbox{\boldmath{$ \beta $}})$ simply by $L(\mbox{\boldmath{$ \beta $}})$ . Finally, $\preceq$ is the elementwise inequality in the vector.

2 Primal path algorithm for comlasso

The almost quadratic loss function $l:\mathbb{R}\times\mathbb{R}\mapsto\mathbb{R}$ is defined by $l(y,{\bf x}^{\top}\mbox{\boldmath{$ \beta $}})=a(r)r^{2}+b(r)r+c(r)$ , where $a(r),b(r),$ , and $c(r)$ are piecewise constants depending on $r$ , which is a function of $y$ and ${\bf x}^{\top}\mbox{\boldmath{$ \beta $}}$ . In the regression model, $r$ is the residual, $y-{\bf x}^{\top}\mbox{\boldmath{$ \beta $}}$ , and in the classification model, $r$ is the margin, $y{\bf x}^{\top}\mbox{\boldmath{$ \beta $}}.$ Rosset and Zhu, (2007) proposed an algorithm for obtaining the piecewise linear solution path of the $l_{1}$ regularized optimization problem with the use of the above almost quadratic loss function. Essentially, piecewise linearity is derived from the condition that the second-order derivative of the loss function is locally constant. See examples 1 and 2 in the Appendix, where lists of popular almost quadratic functions are illustrated. Additionally, comlasso employs the property of the loss function and finds a solution set of (1) satisfying piecewise linearity. Therefore, comlasso is mainly based on conventional solution path algorithms. However, it monitors the residuals or margins between ${\bf y}$ and $X\mbox{\boldmath{$ \beta $}}$ on $\mbox{\boldmath{$ \beta $}}\in\mbox{null}(D^{\top})$ and exactly computes the dual parameters corresponding to the equality constraint $D^{\top}\mbox{\boldmath{$ \beta $}}={\bf 0}_{K}$ . This is the key step to implement the comlasso algorithm in a high-dimensional model.

The comlasso algorithm consists of four steps: determining the first active coefficients with maximum $\lambda$ , identifying the direction of solution path for the active coefficients, computing the step size of the solution path by monitoring the violation of KKT conditions, and updating the active coefficients and dual parameters. The last three steps are repeated until the algorithm cannot update the set of active coefficients anymore.

2.1 KKT conditions and initialization of comlasso

The Lagrangian of primal problem (1) is given by

[TABLE]

where $\mbox{\boldmath{$ \beta $}}=(\beta_{1},\cdots,\beta_{p})^{\top}$ , $\mbox{\boldmath{$ \mu $}}=(\mu_{1},\cdots,\mu_{K})^{\top}\in\mathbb{R}^{K},$

[TABLE]

and $\mu$ is the dual parameter corresponding to the equality constraint. The KKT conditions are as follows: there exists a vector $(\mbox{\boldmath{$ \beta $}},\mbox{\boldmath{$ \mu $}})$ , so that

[TABLE]

where ${\cal A}=\{j:\beta_{j}\neq 0\}$ , $\nabla L(\mbox{\boldmath{$ \beta $}})$ is the gradient vector of $L(\mbox{\boldmath{$ \beta $}})$ and $\nabla_{{\cal A}}L(\mbox{\boldmath{$ \beta $}})$ denotes $(\nabla L(\mbox{\boldmath{$ \beta $}}))_{{\cal A}}$ . Eqs. (6) and (8) are the stationarity and primal feasibility conditions, respectively. Eq. (7) is a stationarity condition with the existence of dual parameter $\mu$ , which plays a key role in comlasso. Note that, if it exists, the dual parameter $\mu$ is not necessarily unique.

Specifically, to describe the existence of $\mu$ , we introduce the dual feasible set of $\mu$ for a given $\lambda$ and $\beta$ :

[TABLE]

Particularly, let $\mbox{\boldmath{$ \beta $}}={\bf 0}_{p}$ ; then, ${\cal E}_{\lambda}({\bf 0}_{p})$ is the feasible set of $\mu$ corresponding to the initial solution of comlasso. If we let $\lambda_{\max}=\inf\{\lambda\in\mathbb{R}:{\cal E}_{\lambda}({\bf 0}_{p})\neq\phi\}$ , $\mbox{\boldmath{$ \beta $}}={\bf 0}_{p}$ is a trivial optimal solution of (1) for all $\lambda\geq\lambda_{\max}$ . However, for $\lambda<\lambda_{\max},$ $\mbox{\boldmath{$ \beta $}}={\bf 0}_{p}$ is not the solution of (1) anymore, because the corresponding $\mu$ does not exist. Therefore, comlasso sets $\lambda=\lambda_{\max}$ as the initializing regularization parameter.

Obviously, ${\cal E}_{\lambda}({\bf 0}_{p})$ is not empty if and only if

[TABLE]

for all $1\leq k\leq K$ , where ${\bf{d}}=({\bf d}_{1}^{\top},\cdots,{\bf d}_{K}^{\top})^{\top}=(d_{1},\cdots,d_{p})^{\top}$ . Note that ${\bf d}_{k}=(d_{j}:j\in{\cal G}_{k})$ and ${\cal G}_{k}$ is as per the notation described above. Therefore, we have

[TABLE]

If there exists the unique triplet $(k,j,j^{\prime})$ satisfying Eq. (9), the $\mu_{k}$ satisfying (8) is uniquely given by $\lambda_{\max}/d_{j^{\prime}}-\nabla_{j^{\prime}}L({\bf 0}_{p})/d_{j^{\prime}}$ or $-\lambda_{\max}/{d_{j}}-\nabla_{j}L({\bf 0}_{p})/{d_{j}}$ . Furthermore, the sign of $\beta_{j}$ and $\beta_{j^{\prime}}$ is assigned by $\mbox{sign}(d_{j})$ and $-\mbox{sign}(d_{j^{\prime}})$ , which will be explained in Section 3.3. The dual feasible set $\mathcal{E}_{\lambda}(\mbox{\boldmath{$ \beta $}})$ will be used to re-check the violation of the KKT conditions.

Remark. Gaines et al., (2018) proposed a method to find an initialized solution under the constrained lasso, which includes problem (1) as a special case. However, it is not clearly shown whether one or more coefficients can be simultaneously activated on the null space of $D^{\top}$ . In problem (1), two or more coefficients should be simultaneously activated in the initialization of the solution path algorithm. Additionally, all the signs of activated coefficients cannot be equal. When there are the multiple triplets in (9), comlasso chooses only one of them and activates the corresponding coefficients.

2.2 Finding the direction of the solution path

Rosset and Zhu, (2007) proposed a toolbox for the linear solution path. We explain the piecewise linearity of the solution path based on this toolbox, considering the regression problem for convenience. To emphasize the dependence on $\lambda$ , we denote the solution of (4) by $\mbox{\boldmath{$ \beta $}}(\lambda)=(\beta_{1}(\lambda),\cdots,\beta_{p}(\lambda))^{\top}$ , and let the active variable and group variable index set be ${\cal A}=\{j:\beta_{j}(\lambda)\neq 0\}$ and $\mathcal{K}=\{k:\|\mbox{\boldmath{$ \beta $}}_{{\cal G}_{k}}(\lambda)\|_{1}\neq 0\}$ , respectively. By the KKT conditions in (6) and (8), we have the following two equations:

[TABLE]

where $r_{i}(\lambda)=(y_{i}-{\bf x}_{i{\cal A}}^{\top}\mbox{\boldmath{$ \beta $}}_{{\cal A}}(\lambda))$ . We set a candidate of solutions $\mbox{\boldmath{$ \beta $}}(\lambda-\delta)$ with $\delta>0$ as

[TABLE]

where ${\bf b}(\lambda)$ and ${\bf m}(\lambda)$ are the solutions of linear equation

[TABLE]

$H(\lambda)=\sum_{i=1}^{n}2a(r_{i}(\lambda)){\bf x}_{i{\cal A}}{\bf x}_{i{\cal A}}^{\top}$ , and $O$ is the $|\mathcal{K}|\times|\mathcal{K}|$ zero valued matrix. $({\bf b}(\lambda)^{\top},{\bf m}(\lambda)^{\top})^{\top}$ uniquely exists if $H(\lambda)$ is positive and definite and $D_{{\cal A}}$ is full rank. It is easily verified that $\mbox{\boldmath{$ \beta $}}(\lambda-\delta)$ and $\mbox{\boldmath{$ \mu $}}_{{\cal K}}(\lambda-\delta)$ always satisfy

[TABLE]

and $D^{\top}\mbox{\boldmath{$ \beta $}}(\lambda-\delta)={\bf 0}_{p}$ , if $\mbox{sign}(\mbox{\boldmath{$ \beta $}}(\lambda))=\mbox{sign}(\mbox{\boldmath{$ \beta $}}(\lambda-\delta))$ , $a(r_{i}(\lambda-\delta))=a(r_{i}(\lambda))$ , and $b(r_{i}(\lambda-\delta))=b(r_{i}(\lambda))$ for all $i$ . These two equations imply that $\mbox{\boldmath{$ \beta $}}(\lambda-\delta)$ and $\mbox{\boldmath{$ \mu $}}(\lambda-\delta)$ are respectively the primal and dual solutions of

[TABLE]

as far as there exists $\mbox{\boldmath{$ \mu $}}_{{\cal K}^{c}}(\lambda-\delta)$ so that

[TABLE]

Particularly, the existence of $\mbox{\boldmath{$ \mu $}}_{{\cal K}^{c}}(\lambda-\delta)$ is verified by the dual feasible set $\mathcal{E}_{\lambda-\delta}(\mbox{\boldmath{$ \beta $}}(\lambda-\delta))$ defined in Section 2.1. Therefore, for $\mbox{\boldmath{$ \beta $}}(\lambda-\delta)$ and $\mbox{\boldmath{$ \mu $}}(\lambda-\delta)$ with $\delta\geq 0$ , we can categorize the violations of the KKT conditions and updating rules as follows:

Rules for updating the activated primal and dual solutions

(C1)

termination condition: if $\lambda-\delta=0$ , then comlasso stops;

(C2)

sign condition: if $\mbox{\boldmath{$ \beta $}}(\lambda-\delta)\neq\mbox{\boldmath{$ \beta $}}(\lambda)$ , then update ${\cal A}$ ;

(C3)

condition of piecewise constant in loss function: if

[TABLE]

then update $H(\lambda-\delta)$ ;

(C4)

dual feasibility condition: if $\mathcal{E}_{\lambda-\delta}(\mbox{\boldmath{$ \beta $}}(\lambda-\delta))=\phi$ , then update ${\cal A}$ and ${\cal K}$ ;

(C5)

stationarity condition: if there exists $\mbox{\boldmath{$ \mu $}}(\lambda-\delta)$ , but

[TABLE]

for some $j\in{\cal A}^{c}$ , then update ${\cal A}$ .

In applying the updating rules, the signs of the active coefficients are determined by the inequality of each condition. The determination of signs will be explained in the following subsection. If ${\cal A}$ , ${\cal K}$ , and $\mbox{sign}(\mbox{\boldmath{$ \beta $}}_{{\cal A}})$ are obtained, then comlasso computes the direction of the solution path by solving (16) again and produces the solution path by increasing $\delta$ before one of the violations occurs. There is also the need to find the minimum $\delta>0$ that leads one of the violation events. We call the minimum $\delta$ the step size of the solution path.

Remark. In the classification problem, (6) in the KKT conditions is expressed as

[TABLE]

Additionally, $H(\lambda)$ is same as that in the regression case. Therefore, equation (16) is also applied to find $({\bf b}(\lambda)^{\top},{\bf m}(\lambda)^{\top})^{\top}$ in the classification problem.

2.3 Computation of the step size

Let $\mbox{\boldmath{$ \beta $}}(\lambda)$ and $\mbox{\boldmath{$ \mu $}}(\lambda)$ be the primal and dual solutions of (1), respectively, and $(\bf{b}(\lambda)^{\top},\bf{m}(\lambda)^{\top})^{\top}$ the solution of (16). It is easy to verify whether the first two conditions, (C1) and (C2), are still valid for $\delta\geq 0$ . The violation of condition (C3) is explained by Rosset and Zhu, (2007) in detail, and we call this violation event “hitting the knots.” As such, we explain the dual feasibility (C4) and stationarity conditions (C5) to determine the step size and sign of the new activated coefficients.

First, we assume that step size is determined by the violation in the dual feasibility condition (C4). The dual feasibility condition is written as

[TABLE]

for $j\in{\cal A}$ , and

[TABLE]

for all $j\in{\cal A}^{c}$ . Here, $\nabla_{j}L(\mbox{\boldmath{$ \beta $}}(\lambda-\delta))=\nabla_{j}L(\mbox{\boldmath{$ \beta $}}(\lambda))+\sum_{i=1}^{n}\left(2a(r_{i}(\lambda))({\bf x}_{i})_{j}{\bf x}_{i{\cal A}}^{\top}{\bf m}(\lambda)\right)\delta,$ , being a linear function of $\delta$ . Therefore, the computation of step size is reduced to the problem of finding the maximum $\delta$ with ${\cal E}_{\lambda-\delta}(\mbox{\boldmath{$ \beta $}}(\lambda-\delta))\neq\phi$ , which can be solved by the following linear programming problem:

[TABLE]

for all $k\in{\cal K}^{c}$ . It is clear that the feasible set of problem (18) is not empty. In fact, solving (18) requires at most $\sum_{k=1}^{K}|{\cal G}_{k}|(|{\cal G}_{k}|-1)/2$ intersections of intervals.

Let $\delta^{*}$ be the solution of problem (18); then, there exist $k\in{\cal K}^{c}$ and $(j,j^{\prime})\in{\cal G}_{k}$ so that

[TABLE]

which is equal to $-\mu_{k}(\lambda-\delta^{*})$ . The violation of the dual feasibility condition means that $\mbox{\boldmath{$ \beta $}}(\lambda-\delta)+\delta{\bf m}(\lambda-\delta)$ for $\delta>\delta^{*}$ is not the solution of (17) and

[TABLE]

for $k$ and $(j,j^{\prime})$ in (19). The two equations above imply that $k$ and $(j,j^{\prime})$ should be included in ${\cal K}$ and ${\cal A}$ to find a new direction of the solution path, respectively, and that the signs of the new activated coefficients are given by $\mbox{sign}(\beta_{j})=-\mbox{sign}(d_{j})$ and $\mbox{sign}(\beta_{j^{\prime}})=\mbox{sign}(d_{j^{\prime}})$ .

Next, we assume that step size is determined by the violation in the stationarity condition (C5) and let step size be $\delta^{*}$ again. In this case, $\mbox{\boldmath{$ \mu $}}(\lambda-\delta^{*})$ exists but, for some $j\in{\cal A}^{c}$ and $k(j)\in{\cal K}$ ,

[TABLE]

or

[TABLE]

To find a new direction, we should include index $j$ in ${\cal A}$ and let $\mbox{sign}(\beta_{j}(\lambda-\delta^{*}))=1$ for the former case and $\mbox{sign}(\beta_{j}(\lambda-\delta^{*}))=-1$ for the latter.

2.4 Uniqueness of the solution path

As previously mentioned, the comlasso algorithm consists of the initialization, determination of the signed active set, and computation of step size. We summarize comlasso as follows:

comlasso algorithm

•

Initialization

–

Compute $\lambda_{\max}$ from (9) and let $\lambda=\lambda_{\max}$ and $\mbox{\boldmath{$ \beta $}}(\lambda)={\bf 0}_{p}$ ;

–

Initialize ${\cal A}$ and ${\cal K}$ and compute the corresponding dual parameter $\mu_{k}$ for $k\in{\cal K}$ ;

–

Assign the sign of the active coefficients $\beta_{j}(\lambda)$ for $j\in{\cal A}$ .

•

Repeat

–

Compute $\bf{b}(\lambda)$ and ${\bf m}(\lambda)$ by solving (16);

–

Compute $\delta^{*}=\min(\delta_{1},\delta_{2},\delta_{3},\delta_{4},\delta_{5})$ based on the updating rules in Section 2.2:

(a)

$\delta_{1}=\lambda$ ; 2. (b)

$\delta_{2}=\sup\{\delta\geq 0:\mbox{sign}(\beta_{j}(\lambda-\delta))=\mbox{sign}(\beta_{j}(\lambda))\mbox{ for all }j\in{\cal A}\}$ ; 3. (c)

$\delta_{3}$ is the smallest $\delta$ of “hitting the knot;” 4. (d)

$\delta_{4}$ is computed by (18); 5. (e)

$\delta_{5}=\{\delta\geq 0:|\nabla_{j}L(\mbox{\boldmath{$ \beta $}}(\lambda-\delta))+(D\mbox{\boldmath{$ \mu $}}(\lambda-\delta))_{j}|\leq(\lambda-\delta)\mbox{ for all }j\in{\cal A}^{c},k(j)\in{\cal K}\}.$

–

Let $\mbox{\boldmath{$ \beta $}}_{{\cal A}}(\lambda-\delta^{*})=\mbox{\boldmath{$ \beta $}}_{{\cal A}}(\lambda)+\delta^{*}{\bf b}(\lambda)$ and $\mbox{\boldmath{$ \mu $}}_{{\cal K}}(\lambda-\delta^{*})=\mbox{\boldmath{$ \mu $}}_{{\cal K}}(\lambda)+\delta^{*}{\bf m}(\lambda)$ and update $\lambda\leftarrow\lambda-\delta^{*}$ , ${\cal A}$ , ${\cal K}$ , and the sign of $\mbox{\boldmath{$ \beta $}}_{{\cal A}}(\lambda)$ ;

–

If $\delta^{*}=0$ , then stop.

As a special case of (1), we consider the problem (1). Let $j$ and $j^{\prime}\in\{1,\cdots,p\}$ be the index satisfying $X_{j}^{\top}{\bf y}\leq X_{l}^{\top}{\bf y}$ for all $l\neq j$ and $X_{j^{\prime}}^{\top}{\bf y}\geq X_{l}^{\top}{\bf y}$ for all $l\neq j^{\prime}$ . Then, the initializing regularization parameter is $\lambda_{\max}=(X_{j^{\prime}}-X_{j})^{\top}{\bf y}/2$ and the dual parameter associated with the equality constraint is uniquely given by $\mu=(X_{j}+X_{j^{\prime}})^{\top}{\bf y}/2$ . The following theorem shows that the solution path in problem (1) is unique and, thus, comlasso pursues the unique solution path.

Theorem 1.

If any $n\times n$ , the submatrix of $X$ has full rank; for $\lambda>0$ , there is the unique solution of the problem (1) and comlasso provides the solution path for $\lambda>0$ .

Proof.

We fix $\lambda>0$ and denote the dual solution of (1) by $\hat{\mu}\in\mathbb{R}.$ First, we show that $\hat{\mu}$ is unique. Let $\hat{}\mbox{\boldmath{$ \beta $}}$ and $\tilde{}\mbox{\boldmath{$ \beta $}}$ be the solution of (1). Additionally, we set $\tilde{\mu}$ as another dual solution and let ${\cal A}_{\hat{\mu}}=\{j:|X_{j}^{\top}(y-X\hat{}\mbox{\boldmath{$ \beta $}})+\hat{\mu}|=\lambda\}$ and ${\cal A}_{\tilde{\mu}}=\{l:|X_{l}^{\top}(y-X\tilde{}\mbox{\boldmath{$ \beta $}})+\tilde{\mu}|=\lambda\}$ . From the strict convexity of the objective function, $X\hat{}\mbox{\boldmath{$ \beta $}}=X\tilde{}\mbox{\boldmath{$ \beta $}}$ and $X^{\top}(y-X\hat{}\mbox{\boldmath{$ \beta $}})=X^{\top}(y-X\tilde{}\mbox{\boldmath{$ \beta $}})$ (see Lemma 1 in Tibshirani et al., (2013)). The dual function is always concave, so that $t\tilde{\mu}+(1-t)\hat{\mu}$ for all $t\in[0,1]$ also represent the dual solution. Assuming that ${\cal A}_{\hat{\mu}}\cap{\cal A}_{\tilde{\mu}}$ is not empty,

[TABLE]

for $j\in{\cal A}_{\hat{\mu}}\cap{\cal A}_{\tilde{\mu}},$ which is constant. Therefore, ${\cal A}_{\hat{\mu}}\cap{\cal A}_{\tilde{\mu}}$ must be empty if $\hat{\mu}\neq\tilde{\mu}$ . However, we know that $t\hat{}\mbox{\boldmath{$ \beta $}}+(1-t)\tilde{}\mbox{\boldmath{$ \beta $}}$ for $t\in(0,1)$ is also the primal solution of (1), which is a contradiction that the intersection of the active sets of primal solutions $\hat{}\mbox{\boldmath{$ \beta $}}$ and $t\hat{}\mbox{\boldmath{$ \beta $}}+(1-t)\tilde{}\mbox{\boldmath{$ \beta $}}$ must be empty. That is, $\hat{\mu}$ is unique if ${\cal A}_{\hat{\mu}}$ is not empty.

From the uniqueness of the dual solution, the corresponding $\hat{}\mbox{\boldmath{$ \beta $}}$ is unique because any $n\times n$ submatrix of $X$ has full rank (see Theorem 5 in Osborne et al., (2000)). This result implies that comlasso pursues the unique solution of (1) for $\lambda>0$ .

$\square$

3 Numerical examples

3.1 Toy example analysis

The comlasso algorithm can be applied to the compositional regression model estimation with adaptive lasso penalty (Zou,, 2006) through simple reparametrization. We applied the adaptive lasso to sand data analysis (Aitchison,, 1986) as a toy example of compositional regression. The sand data consist of 39 sediment samples, collected at different water depths. The sediment components are sand, silt, and clay, measured in percentages for each sample. Water depth is the response variable and the others are predictors in the compositional regression. The weights for adaptive penalty are the inverses of non-regularized estimates. We compare the solution paths and the regression coefficients of the lasso and adaptive lasso estimates chosen by Bayesian information criteria (BIC). We set the degree of freedom in BIC as the number of active coefficients subtracted from 1.

Figure 1 shows the solution paths of lasso (left) and adaptive lasso (right). The vertical line denotes the selected model using BIC. The non-regularized estimates corresponding to the predictor, clay, are relatively small. A large penalization on the predictor under adaptive lasso leads different solution paths from that of lasso.

3.2 Speed comparison

Comlasso adopts the idea of LARs Efron et al., (2004) and the piecewise linear toolbox (Rosset and Zhu,, 2007) that pursue the solution path under lasso. Compared with LARs, the computational cost of the comlasso algorithm is not significantly more expensive. Only $|{\cal K}|$ more Lagrangian variables are used in the linear system to find the solution path directions and a maximum of $\sum_{k\notin{\cal K}}p_{k}(p_{k}-1)/2$ additional arithmetic computations are required to obtain the step size. Compared with genlasso (Tibshirani and Taylor,, 2011), comlasso has a computational advantage in a high-dimensional regression model. In fact, genlasso was originally developed to solve the regression problem with the full ranked design matrix, so that it employs an additional $l_{2}$ penalty when $p>n$ . The $l_{2}$ penalty function is incorporated into the design matrix for the full rank design, which leads to a drastic increase in computational cost, especially when $p$ is significantly larger than $n$ (see Section 7 in Tibshirani and Taylor, (2011)). However, comlasso does not require the full ranked design matrix, nor suffer from such a computation problem.

We consider an optimization problem with quadratic loss and perform two simulations to compare the computational speeds of genlasso and comlasso. The simulation setting is similar to Lin et al., (2014). Here, ${\bf x}_{i}\in\mathbb{R}^{p}$ for $i=1,\cdots,n$ , are generated from the logarithm of the logistic normal distribution with $(\mbox{\boldmath{$ \eta $}},\Sigma)$ , where $\mbox{\boldmath{$ \eta $}}_{{\cal G}_{k}}=\log(p_{k}/2){\bf 1}_{p_{k}}$ for $k=1,\cdots,5$ , $\mbox{\boldmath{$ \eta $}}_{{\cal G}_{k}}={\bf 0}_{p_{k}}$ for $6\leq k\leq K$ , and $\Sigma$ is the block diagonal covariance matrix, whose diagonal block is given by $\Sigma_{k}$ for $k=1,\cdots,K$ with $(\Sigma_{k})_{jl}=0.5^{|j-l|}$ for $1\leq j,l\leq p_{k}$ . The regression parameters are given by $\mbox{\boldmath{$ \beta $}}_{{\cal G}_{1}}=(1,-0.8,0.6,0,0,-1.5,-0.5,1.2,0\cdots,0)^{\top}\in\mathbb{R}^{p_{1}}$ and $\mbox{\boldmath{$ \beta $}}_{{\cal G}_{k}}={\bf 0}_{p_{k}}$ for $k=2,\cdots,K$ , and $\epsilon_{i}$ s are independently generated from $N(0,0.5^{2}).$

For a fair comparison of computational time, we control for the maximum number of kinks in the solution path. Due to the use of the $l_{2}$ penalty function in the high-dimensional cases, genlasso produces less sparse solutions with more kinks, even when the degree of freedom achieves maximum in the primal problem. To prevent redundant computations for genlasso, we bind the maximum number of kinks for which comlasso achieves the maximum degree of freedom in the primal problem. Additionally, we do not include the computation time of reparametrization required for genlasso. We use R for the comlasso algorithm (see https://github.com/jenjong/ComLasso) and implement genlasso and comlasso using a PC with 3.8 Ghz CPU and 16 GB memory under a Win 10 operating system. Computation time is measured by 20 repetitions.

In the first simulation, we let $K=1$ and vary $p$ and $n$ . Table 1 presents the average computation times and standard errors for each $p$ and $n$ . Comlasso is efficient compared to genlasso, especially when $p$ is large. In addition, comlasso is less affected by $p$ than genlasso. This may be because the main computational burden in comlasso arises in the matrix inversion to determine direction.

In the second simulation, we fix $p=1000$ and vary $n$ and $K$ . We let $p_{k}=p/K$ for $k=1,\cdots,K$ . Table 2 summarizes the simulation results. Comlasso still performs well in the high-dimensional case. As $K$ increases, more computations are required for checking the violation of the KKT conditions (7), so that comlasso becomes slightly slower to produce the solution path. Genlasso seems to be faster as $K$ increases, that is, the dimension of $\mbox{null}(D^{\top})$ increases for a fixed $n$ .

Remark. (comparison of sparsity) Since genlasso (Tibshirani and Taylor,, 2011) uses the $l_{2}$ penalty function, its solution is less sparse than the optimal solution. The sparse solution can be obtained by proper thresholding and most calibrated solutions are close to those of comlasso, at least in our experiments. However, there are no studies for the determination of the threshold level in this case.

3.3 Real data analysis

We apply the comlasso algorithm to estimate a classification model that predicts the incidence of bloodstream infections (BSI) in patients. We use the microbiome dataset (Montassier et al.,, 2016), which consists of 11 BSI patients and 17 non-BSI patients and amounts to 3837 operational taxonomic units (OTUs) of microbiomes, that is, pragmatic proxies for microbial “species,” identified by DNA sequencing. The 3837 OTUs belong to one of 111 genera, which in turn belong to 6 phyla. In the analysis, we exclude the two phyla and the corresponding genera and OTUs because they only have a single OTU. As a result, the 3833 OTUs belonging to the 107 genera and 4 phyla are used for further analysis.

We aggregate the amounts of OTUs in each genus and use them as covariates. That is, the data consist of one binary response and 107 genus-level amounts of OTUs. From this dataset, we compose two compositional datasets. The first one is to take the proportions of the 107 genus-level amounts of the OTUs (i.e. $K=1$ and $p=107$ ).

The second dataset considers the proportions between genera belonging to the same phylum. There are four types of phyla (Actinobacteria, Bacteroidetes, Firmicutes, and Proteobacteria), Which include 20, 10, 62, and 15 genera, respectively, and let the index set of genera denoting their phylum levels be ${\cal G}_{k}\subset\{1,\cdots,p\}$ for $k=1,\cdots,4$ . Therefore, the dimensions of the compositional predictors are $K=4$ , and $(p_{1},p_{2},p_{3},p_{4})=(20,10,62,15)$ . The averages of genus-level aggregated OTUs in each phylum are $45.14$ , $118.71$ , $50.11$ , and $3.28,$ , respectively, being quite different from each other. Hence, the proportions separately calculated for each phylum in the second dataset are quite different from the proportions among all genera in the first dataset. Note that the compositional regression model for the second dataset is a special case of the compositional regression model for the first dataset with additional constraints on the regression coefficients of each phylum. Therefore, using the proportions in each phylum as compositional predictors can reflect the information about the phylogenetic tree into the classification model.

We use the quadratic hinge loss (see the Appendix) to measure the predictive performance of the estimated models by leave-one-out cross validation. Additionally, we apply the stability selection procedure (Meinshausen and Bühlmann,, 2010) to assess the importance of predictors.

Figure 2 presents the leave-one-out errors of the three models, comlassoA (for the first dataset), comlassoB (for the second dataset,) and the non-regularized model (Montassier et al.,, 2016), where the model is fitted with the selected predictors by the Wilcoxon test in the first dataset. The figure illustrates that comlassoA and comlassoB show lower cross validation errors than the non-regularized model and that comlassoB tends to outperform comlassoA. Additionally, the choice of $\gamma$ in the quadratic hinge loss does not affect the results significantly.

Figure 3 draws the entire solution paths for each phylum of comlassoA and comlassoB for $\gamma=-0.2$ . The horizontal axis denotes the value of $\|\mbox{\boldmath{$ \beta $}}_{{\cal G}_{k}}(\lambda)\|_{1}/\|\mbox{\boldmath{$ \beta $}}_{{\cal G}_{k}}(\lambda_{\max})\|_{1}$ . Figure 3 also shows substantial differences among the estimated coefficients of the genera in the Proteobacteria phylum group. Specifically, for a large $\lambda$ , that is, for a small $\|\mbox{\boldmath{$ \beta $}}(\lambda)\|_{1}/\|\mbox{\boldmath{$ \beta $}}(\lambda_{max})\|_{1}$ , comlassoB gives more sparse coefficients at the phylum level than comlassoA. As previously pointed out, the average number of OTUs in each genus in the Proteobacteria group is significantly lower than the others. This means that the relative ratios seriously depend on the set where ratios are computed. Therefore, the coefficients corresponding to the Proteobacteria group are activated in an earlier step in comlassoA than comlassoB. This can be explained by a normalization of composite predictors, which reflects the structure of the phylum as prior knowledge of features.

Finally, to evaluate the importance of predictors we use the stability selection procedure (Meinshausen and Bühlmann,, 2010). Specifically, we use 1000 subsamples for half of the samples and fit the randomized lasso estimates with weakness parameter $0.5$ for each subsample. The regularization parameter of the randomized lasso is chosen by five-fold cross validation. Comlasso is applied to implement the randomized lasso algorithm. Table 1 summarizes the 10 genera with the highest selection probabilities in comlassoA and comlassoB, as well as the top-10 genera with the smallest p-values in the Wilcoxon test. Table 1 also shows that Lactobacillus was selected as the most important genus for both comlassoA and comlassoB. The Wilcoxon test also indicates that Lactobacillus is significant in the top-10 rank in terms of $p$ -value. Multiple literatures showed Lactobacillus can cause serious infections of the bloodstream (Cannon et al.,, 2005; Salminen et al.,, 2002).

Table 3 indicates that the multiple genus in the Proteobacteria phylum are selected by comlassoA and Wilcoxon test. However, comlassoB does not select any genus included in the phylum within the top-10 ranked predictors. As previously discussed, this selection pattern of comlassoB is partly owed to a normalization in compositional data. Considering predictive performance, we may conclude that it is worth investigating additional biological evidence based on the results of comlassoB under larger samples.

4 Concluding remarks

We developed a solution path algorithm for an $l_{1}$ regularized regression model with a compositional predictor in the high-dimensional case. Comlasso generalizes the equiangular direction of the LARs on a specific linear space and directly produces the entire solution path under the primal problem. Compared with genlasso, comlasso has the advantage of not only numerical precision but also computational efficiency in the high-dimensional regression model. Moreover, comlasso is easy to be extended to the regularized regression problem with almost quadratic loss, such as the Huberized loss and expectile loss functions. Because comlasso does not require continuously twice differentiability of the loss function, it can provide a new, efficient method for optimizing (1), where the solution path algorithm of Gaines et al., (2018) is not covered.

Additionally, we elaborated a way of monitoring the feasibility of the dual parameters corresponding to the equality constraints. In our case, verifying the condition requires only finite intersections of half intervals. Although we assume a special structure of the considered equality constraint, the proposed method to monitor dual feasibility can be extended to a general linear constraint, in which step size is computed by linear programming.

As an application of comlasso, we used the $l_{2}$ hinge loss for the classification problem and analyzed the microbiome dataset. We also considered two different types of constraints and compared the predictive performances of the considered models. The numerical results indicate that the regularized models show better performance than the non-regularized one. Additionally, we implemented stability selection to evaluate the importance of predictors.

Appendix A Appendix

A.1 Examples

Example 1.

Loss function in regression

•

Quadratic loss is given by $l(y,{\bf x}^{\top}\mbox{\boldmath{$ \beta $}})=r^{2}/2,$ and, in this case, $a(r)=1/2$ and $b(r)=c(r)=0$ .

•

The asymmetric $l_{2}$ loss with $h\in(0,1)$ is given by

[TABLE]

Where the corresponding constants are piecewise defined as $a(r)=(1-h),b(r)=c(r)=0$ for $r\leq 0,$ and $a(r)=h,b(r)=c(r)=0$ for $r>0$ (Aigner et al.,, 1976).

•

The Huberized loss function with a fixed knot $h$ is given by

[TABLE]

where the corresponding constants are piecewise defined as $a(r)=1/2,b(r)=c(r)=0$ for $|r|\leq h,$ and $a(r)=0,b(r)=h{\rm sign}(r),c(r)=-h^{2}/2$ for $|r|>h$ (Rosset and Zhu,, 2007).

Example 2.

Loss function in classification

•

The squared hinge loss function is given by $l(y,{\bf x}^{\top}\mbox{\boldmath{$ \beta $}})=\big{(}\max(0,1-r)\big{)}^{2}/2$ , in which $a(r)=b(r)=c(r)=0$ for $r>1$ and $a(r)=1/2,b(r)=-1,c(r)=1/2$ for $r<1$ (Lee and Lin,, 2013).

•

The Huberized loss function with a fixed knot $h(\leq 1)$ is given by

[TABLE]

in which $a(r)=0$ , $b(r)=0$ , $c(r)=0$ for $r\geq 1,$ and $a(r)=1/2$ , $b(r)=-1$ , $c(r)=0$ for $h\leq r\leq 1,$ $a(r)=0$ , $b(r)=h-1$ , $c(r)=0$ for $r\leq h$ (Rosset and Zhu,, 2007).

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Aigner et al., (1976) Aigner, D. J., Amemiya, T., and Poirier, D. J. (1976). On the estimation of production frontiers: maximum likelihood estimation of the parameters of a discontinuous density function. International Economic Review , pages 377–396.
2Aitchison, (1986) Aitchison, J. (1986). The Statistical Analysis of Compositional Data . Chapman & Hall, Ltd., London, UK.
3Aitchison and Bacon-shone, (1984) Aitchison, J. and Bacon-shone, J. (1984). Log contrast models for experiments with mixtures. Biometrika , 71(2):323–330.
4Bondell and Reich, (2009) Bondell, H. D. and Reich, B. J. (2009). Simultaneous factor selection and collapsing levels in anova. Biometrics , 65(1):169–177.
5Brodie et al., (2009) Brodie, J., Daubechies, I., De Mol, C., Giannone, D., and Loris, I. (2009). Sparse and stable markowitz portfolios. Proceedings of the National Academy of Sciences , 106(30):12267–12272.
6Buccianti et al., (2006) Buccianti, A., Mateu-Figueras, G., and Pawlowsky-Glahn, V. (2006). Compositional data analysis in the geosciences: from theory to practice. Geological Society of London.
7Cannon et al., (2005) Cannon, J. P., Lee, T. A., Bolanos, J. T., and Danziger, L. H. (2005). Pathogenic relevance of lactobacillus: a retrospective review of over 200 cases. European Journal of Clinical Microbiology and Infectious Diseases , 24(1):31–40.
8Efron et al., (2004) Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. Ann. Statist. , 32(2):407–499.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Primal path algorithm for compositional data analysis

Abstract

1 Introduction

2 Primal path algorithm for comlasso

2.1 KKT conditions and initialization of comlasso

2.2 Finding the direction of the solution path

2.3 Computation of the step size

2.4 Uniqueness of the solution path

Theorem 1**.**

Proof.

3 Numerical examples

3.1 Toy example analysis

3.2 Speed comparison

3.3 Real data analysis

4 Concluding remarks

Appendix A Appendix

A.1 Examples

Example 1**.**

Example 2**.**

Theorem 1.

Example 1.

Example 2.