Sparse estimation for case-control studies with multiple subtypes of   cases

Nadim Ballout; Cedric Garcia; Vivian Viallon

arXiv:1901.01583·stat.ME·January 23, 2019

Sparse estimation for case-control studies with multiple subtypes of cases

Nadim Ballout, Cedric Garcia, Vivian Viallon

PDF

Open Access 2 Repos

TL;DR

This paper compares methods for analyzing case-control studies with multiple subtypes, proposing adaptations of stratified logistic regression and data shared lasso, and demonstrating their effectiveness through simulations and real data.

Contribution

It introduces a novel approach combining stratified conditional logistic regression with data shared lasso for subtype analysis in case-control studies.

Findings

01

Data shared lasso improves estimation when subtypes are homogeneous.

02

Symmetric multinomial logistic regression reduces to data shared lasso under certain conditions.

03

Proper modeling of subtype homogeneity enhances analysis accuracy.

Abstract

The analysis of case-control studies with several subtypes of cases is increasingly common, e.g. in cancer epidemiology. For matched designs, we show that a natural strategy is based on a stratified conditional logistic regression model. Then, to account for the potential homogeneity among the subtypes of cases, we adapt the ideas of data shared lasso, which has been recently proposed for the estimation of regression models in a stratified setting. For unmatched designs, we compare two standard methods based on L1-norm penalized multinomial logistic regression. We describe formal connections between these two approaches, from which practical guidance can be derived. We show that one of these approaches, which is based on a symmetric formulation of the multinomial logistic regression model, actually reduces to a data shared lasso version of the other. Consequently, the relative…

Equations52

I P (Y_{ℓ, 1}^{(k)} = 1∣ Y_{ℓ, 1}^{(k)} + Y_{ℓ, 2}^{(k)} = 1, x_{ℓ, 1}^{(k)}, x_{ℓ, 2}^{(k)}) = \frac{1}{1 + exp { δ _{k}^{* T} ( x _{ℓ, 1}^{(k)} - x _{ℓ, 2}^{(k)} )}} \cdot

I P (Y_{ℓ, 1}^{(k)} = 1∣ Y_{ℓ, 1}^{(k)} + Y_{ℓ, 2}^{(k)} = 1, x_{ℓ, 1}^{(k)}, x_{ℓ, 2}^{(k)}) = \frac{1}{1 + exp { δ _{k}^{* T} ( x _{ℓ, 1}^{(k)} - x _{ℓ, 2}^{(k)} )}} \cdot

L_{k}^{(co n d)} (δ_{k})

L_{k}^{(co n d)} (δ_{k})

= - [lo g {1_{m_{k}} + exp (Δ^{(k)} δ_{k})}]^{T} 1_{m_{k}}

\bar{\bm{\Delta}}=\left(\begin{array}[]{c c c }{\bm{\Delta}}^{(1)}&\ldots&{\bf 0}_{m_{1},p}\\ \vdots&\ddots&\vdots\\ {\bf 0}_{m_{K-1},p}&\ldots&{\bm{\Delta}}^{(K-1)}\end{array}\right).

\bar{\bm{\Delta}}=\left(\begin{array}[]{c c c }{\bm{\Delta}}^{(1)}&\ldots&{\bf 0}_{m_{1},p}\\ \vdots&\ddots&\vdots\\ {\bf 0}_{m_{K-1},p}&\ldots&{\bm{\Delta}}^{(K-1)}\end{array}\right).

- [lo g {1_{m_{k}} + exp (Δ^{(k)} δ_{k})}]^{T} 1_{m_{k}} - λ ∥ δ_{k} ∥_{1}

- [lo g {1_{m_{k}} + exp (Δ^{(k)} δ_{k})}]^{T} 1_{m_{k}} - λ ∥ δ_{k} ∥_{1}

- [lo g {1_{m} + exp (\overset{ˉ}{Δ} δ)}]^{T} 1_{m} - λ ∥ δ ∥_{1}

- [lo g {1_{m} + exp (\overset{ˉ}{Δ} δ)}]^{T} 1_{m} - λ ∥ δ ∥_{1}

δ_{k, j}^{*} = μ_{j}^{*} + γ_{k, j}^{*} .

δ_{k, j}^{*} = μ_{j}^{*} + γ_{k, j}^{*} .

k \in [K - 1] \sum L_{k}^{(co n d)} (μ + γ_{k}) - λ (∥ μ ∥_{1} + k = 1 \sum K - 1 τ_{k} ∥ γ_{k} ∥_{1}) .

k \in [K - 1] \sum L_{k}^{(co n d)} (μ + γ_{k}) - λ (∥ μ ∥_{1} + k = 1 \sum K - 1 τ_{k} ∥ γ_{k} ∥_{1}) .

μ_{j} = m argmin {∣ m ∣ + k \in [K - 1] \sum ∣ δ_{k, j} - m ∣} = median (δ_{1, j}, \dots, δ_{K - 1, j}, 0) .

μ_{j} = m argmin {∣ m ∣ + k \in [K - 1] \sum ∣ δ_{k, j} - m ∣} = median (δ_{1, j}, \dots, δ_{K - 1, j}, 0) .

k \in [K - 1] \sum L_{k}^{(co n d)} (μ + γ_{k}) - λ (∥ μ ∥_{1} + k = 1 \sum K - 1 τ_{k} ∥ γ_{k} ∥_{1})

k \in [K - 1] \sum L_{k}^{(co n d)} (μ + γ_{k}) - λ (∥ μ ∥_{1} + k = 1 \sum K - 1 τ_{k} ∥ γ_{k} ∥_{1})

= - [lo g {1_{m} + exp (\tilde{Δ}_{τ} Γ)}]^{T} 1_{m} - λ ∥ Γ ∥_{1}

\tilde{\bm{\Delta}}_{{\bm{\tau}}}=\left(\begin{array}[]{c c c c }{\bm{\Delta}}^{(1)}&\frac{{\bm{\Delta}}^{(1)}}{\tau_{1}}&\ldots&{\bf 0}_{m_{1},p}\\ \vdots&\vdots&\ddots&\vdots\\ {\bm{\Delta}}^{(K-1)}&{\bf 0}_{m_{K-1},p}&\ldots&\frac{{\bm{\Delta}}^{(K-1)}}{\tau_{K-1}}\end{array}\right).

\tilde{\bm{\Delta}}_{{\bm{\tau}}}=\left(\begin{array}[]{c c c c }{\bm{\Delta}}^{(1)}&\frac{{\bm{\Delta}}^{(1)}}{\tau_{1}}&\ldots&{\bf 0}_{m_{1},p}\\ \vdots&\vdots&\ddots&\vdots\\ {\bm{\Delta}}^{(K-1)}&{\bf 0}_{m_{K-1},p}&\ldots&\frac{{\bm{\Delta}}^{(K-1)}}{\tau_{K-1}}\end{array}\right).

I P (Y = k ∣ x = x_{0}) = \frac{exp ( x _{0}^{T} β _{k}^{*} )}{\sum _{ℓ = 1}^{K} exp ( x _{0}^{T} β _{ℓ}^{*} )},

I P (Y = k ∣ x = x_{0}) = \frac{exp ( x _{0}^{T} β _{k}^{*} )}{\sum _{ℓ = 1}^{K} exp ( x _{0}^{T} β _{ℓ}^{*} )},

I P (Y = k ∣ x = x_{0}) = \frac{exp ( x _{0}^{T} δ _{k}^{*} ) ^{1 I (k \neq = K)}}{1 + \sum _{ℓ = 1}^{K - 1} exp ( x _{0}^{T} δ _{ℓ}^{*} )} .

I P (Y = k ∣ x = x_{0}) = \frac{exp ( x _{0}^{T} δ _{k}^{*} ) ^{1 I (k \neq = K)}}{1 + \sum _{ℓ = 1}^{K - 1} exp ( x _{0}^{T} δ _{ℓ}^{*} )} .

L (δ_{1}, \dots, δ_{K - 1})

L (δ_{1}, \dots, δ_{K - 1})

p_{ℓ} (x_{0}; u_{1}, \dots, u_{K}) = \frac{exp ( x _{0}^{T} u _{ℓ} )}{\sum _{k = 1}^{K} exp ( x _{0}^{T} u _{k} )} \cdot

p_{ℓ} (x_{0}; u_{1}, \dots, u_{K}) = \frac{exp ( x _{0}^{T} u _{ℓ} )}{\sum _{k = 1}^{K} exp ( x _{0}^{T} u _{k} )} \cdot

δ_{1}, \dots, δ_{K - 1} max [L (δ_{1}, \dots, δ_{K - 1}) - λ k = 1 \sum K - 1 ∥ δ_{k} ∥_{1}] .

δ_{1}, \dots, δ_{K - 1} max [L (δ_{1}, \dots, δ_{K - 1}) - λ k = 1 \sum K - 1 ∥ δ_{k} ∥_{1}] .

L (β_{1}, \dots, β_{K}) = \frac{1}{n} i = 1 \sum n lo g {p_{y_{i}} (x_{i}; β_{1}, \dots, β_{K})} .

L (β_{1}, \dots, β_{K}) = \frac{1}{n} i = 1 \sum n lo g {p_{y_{i}} (x_{i}; β_{1}, \dots, β_{K})} .

L (β_{1}, \dots, β_{K}) - λ k = 1 \sum K ∥ β_{k} ∥_{1}

L (β_{1}, \dots, β_{K}) - λ k = 1 \sum K ∥ β_{k} ∥_{1}

median (β_{1, j}, \dots, β_{K, j}) = 0, for all j \in [p] .

median (β_{1, j}, \dots, β_{K, j}) = 0, for all j \in [p] .

L (μ + γ_{1}, \dots, μ + γ_{K - 1}) - λ (∥ μ ∥_{1} + k = 1 \sum K - 1 ∥ γ_{k} ∥_{1}) .

L (μ + γ_{1}, \dots, μ + γ_{K - 1}) - λ (∥ μ ∥_{1} + k = 1 \sum K - 1 ∥ γ_{k} ∥_{1}) .

L (β_{1}, \dots, β_{K}) - λ k = 1 \sum K ∥ β_{k} ∥_{1}

L (β_{1}, \dots, β_{K}) - λ k = 1 \sum K ∥ β_{k} ∥_{1}

\geq L (β_{1}, \dots, β_{K}) - λ k = 1 \sum K ∥ β_{k} - ν ∥_{1} .

L (β_{1}, \dots, β_{K}) - λ k = 1 \sum K ∥ β_{k} ∥_{1}

L (β_{1}, \dots, β_{K}) - λ k = 1 \sum K ∥ β_{k} ∥_{1}

= L (γ_{1}, \dots, γ_{K - 1}, - μ) - λ (∥ - μ ∥_{1} + k = 1 \sum K - 1 ∥ γ_{k} ∥_{1})

= L (μ + γ_{1}, \dots, μ + γ_{K - 1}, 0_{p}) - λ (∥ μ ∥_{1} + k = 1 \sum K - 1 ∥ γ_{k} ∥_{1})

= L (μ + γ_{1}, \dots, μ + γ_{K - 1}) - λ (∥ μ ∥_{1} + k = 1 \sum K - 1 ∥ γ_{k} ∥_{1}),

{\cal Y}=\left(\begin{array}[]{c}{\cal Y}^{(1)}\\ \vdots\\ {\cal Y}^{(K-1)}\end{array}\right)\quad{\rm and}\quad{\bm{\mathcal{X}}}=\left(\begin{array}[]{c c c }{\bf X}&\ldots&{\bf 0}_{n,p}\\ \vdots&\ddots&\vdots\\ {\bf 0}_{n,p}&\ldots&{\bf X}\end{array}\right),

{\cal Y}=\left(\begin{array}[]{c}{\cal Y}^{(1)}\\ \vdots\\ {\cal Y}^{(K-1)}\end{array}\right)\quad{\rm and}\quad{\bm{\mathcal{X}}}=\left(\begin{array}[]{c c c }{\bf X}&\ldots&{\bf 0}_{n,p}\\ \vdots&\ddots&\vdots\\ {\bf 0}_{n,p}&\ldots&{\bf X}\end{array}\right),

L (δ_{1}, \dots, δ_{K - 1})

L (δ_{1}, \dots, δ_{K - 1})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Advanced Causal Inference Techniques · Statistical Methods and Bayesian Inference

Full text

Sparse estimation for case-control studies with multiple subtypes of cases. Nadim Ballout, Cedric Garcia and Vivian Viallon111Corresponding Author: [email protected]

Abstract

The analysis of case-control studies with several subtypes of cases is increasingly common, e.g. in cancer epidemiology. For matched designs, we show that a natural strategy is based on a stratified conditional logistic regression model. Then, to account for the potential homogeneity among the subtypes of cases, we adapt the ideas of data shared lasso, which has been recently proposed for the estimation of regression models in a stratified setting. For unmatched designs, we compare two standard methods based on L1-norm penalized multinomial logistic regression. We describe formal connections between these two approaches, from which practical guidance can be derived. We show that one of these approaches, which is based on a symmetric formulation of the multinomial logistic regression model, actually reduces to a data shared lasso version of the other. Consequently, the relative performance of the two approaches critically depends on the level of homogeneity that exists among the subtypes of cases: more precisely, when homogeneity is moderate to high, the non-symmetric formulation with controls as the reference is not recommended. Empirical results obtained from synthetic data are presented, which confirm the benefit of properly accounting for potential homogeneity under both matched and unmatched designs. We also present preliminary results from the analysis a case-control study nested within the EPIC cohort, where the objective is to identify metabolites associated with the occurrence of subtypes of breast cancer.

1 Introduction

The rise of -omics and other high-dimensional data (image, reimbursement claims, etc.) in medical science gives researchers access to numerous features that may predict outcomes of interest, like cancer development. However, this relatively cheap source of information comes at a price: the curse of dimensionality makes multivariate modeling of such data impossible without further assumptions. In other words, some prior information has to be properly accounted for to reduce dimensionality and accurately estimate high-dimensional multivariate models. Under parametric regression models, one common prior information, or assumption, is sparsity of the parameter vector. The use of $L_{1}$ -norm regularized approaches has been shown to yield optimal estimates when the true vector is sparse, under technical assumptions on the design matrix [Wainwright, 2009, Bach et al., 2010, Bickel et al., 2009]. As a result, $L_{1}$ penalized logistic models [McCullagh and Nelder, 1989, Park and Hastie, 2007, Wu et al., 2009] are now standard tools when studying risk factors of a disease in a high-dimensional setting.

However, for many diseases that were primarily considered as one single disease (breast cancer, colorectal cancer), several subtypes have now been recognized. They can either be histological, as for breast cancer, or anatomical, as for colorectal cancer. Even if commonalities may exist among these subtypes, they have their own specificities regarding both prognosis and etiology. For example, the cancer epidemiology community is now increasingly concerned with the identification of subtype specific risk factors for various cancer sites. This is the case in our motivating example presented in Section 5, which deals with the identification of metabolites associated with breast cancer subtypes, based on a matched case-control study nested in the EPIC (European Prospective Investigation into Cancer and nutrition) cohort study.

For unmatched case-control studies with multiple subtypes of cases, a natural extension of the binary logistic regression model is the multinomial logistic regression model [McCullagh and Nelder, 1989, Begg and Gray, 1984]. If $K-1$ denotes the number of subtypes for a given integer $K>1$ , inference under this model consists in estimating $K-1$ parameter vectors of size $p$ , where $p$ denotes the number of covariates (which may include interactions as well as an intercept term). On the other hand, for matched case-control studies with $K-1$ subtypes, the total sample can be decomposed into $K-1$ sub-samples, one for each subtype. Assuming for simplicity a 1:1 matching design, each sub-sample is made of pairs composed by one case of one particular subtype and one matched control. Then, each sub-sample can be analyzed separately, e.g. by applying a sparse conditional logistic regression model [Avalos et al., 2015]. Again, the overall analysis boils down to the estimation of $K-1$ parameter vectors of size $p$ .

Because commonalities exist between the subtypes of cases, some level of homogeneity is expected among those $K-1$ parameter vectors, in both the matched and unmatched settings. Properly accounting for this homogeneity is key to reduce the dimensionality and improve estimation efficiency. In the matched setting, $K-1$ sparse conditional logistic regression models have to be estimated on $K-1$ sub-groups, where these sub-groups are defined according to the subtype of the case of each pair (see Section 2 for more details). Then, inference falls into the framework of stratified regression modeling for which data shared lasso has recently been developed as a way to account for the expected homogeneity among the $K-1$ parameter vectors to be estimated [Gross and Tibshirani, 2016, Ollier and Viallon, 2017]. Under linear models, data shared lasso has been shown to enjoy good theoretical and empirical properties [Ollier and Viallon, 2017]. In addition, data shared lasso is easy to implement because it can be rewritten as a standard lasso after a simple transformation of the original data.

In this article, we will show how the ideas of data shared lasso can be applied to analyze case-control studies when multiple subtypes of cases are present. In Section 2, we start with the matched design and we show how data shared lasso can be used to estimate stratified sparse conditional logistic models. Section 3 is devoted to the unmatched setting, under which sparse multinomial logistic regression models are natural, as mentioned above. Actually, two formulations of sparse multinomial logistic regression models have been proposed in the literature. A first one, which we will refer to as the standard one, relies on the selection of a reference category and the estimation of $(K-1)$ parameter vectors [Begg and Gray, 1984]. Alternatively, a more symmetric formulation of the model can be adopted, where no reference category has to be selected and $K$ parameter vectors are to be estimated [Friedman et al., 2010]. Unpenalized estimation is impossible under this over-parametrized model due to a clear lack of identifiability. However, $L_{1}$ -penalized estimation can be performed, as implemented in the popular glmnet R package. To our best knowledge, no clear guidance exists in the literature on how to chose between the two formulations of sparse multinomial logistic regression models. We will formally establish that the $L_{1}$ -penalized strategies associated with the two formulations differ in the way they account for potential homogeneity among the parameter vectors to be estimated. More precisely, we show in Section 3 that $L_{1}$ -penalized estimates derived under the symmetric formulation coincide with the estimates derived under the standard formulation when using a data shared lasso penalty. In Section 4, we present results from a simulation study, which illustrate the interest of data shared lasso estimates when homogeneity exists among the parameter vectors to be estimated, under both the matched and unmatched settings. Section 5 is devoted to our illustrative example. Finally, concluding remarks are provided in Section 6.

2 Matched case-control studies with multiple subtypes of cases and stratified conditional logistic models

Conditional logistic regression is a standard tool for the analysis of matched case-control studies when a single type of cases is considered [Pearce, 2016, Rothman et al., 2008]. Here, we show how the ideas of data shared lasso can be applied to handle the situation where $K-1$ subtypes of cases are present, for some given integer $K>1$ .

2.1 Setting

Consider a matched case-control study where information about subtype is available for each case. We denote the number of subtypes by $K-1$ , for some given integer $K>1$ , and we will use the notation $[I]=\{1,\ldots,I\}$ for any integer $I\geq 1$ . For simplicity, we further assume a 1:1 matched case-control design where matching is based on some variables ${\bf W}\in{\rm I}\kern-1.4pt{\rm R}^{q}$ . Denoting by $n\geq 1$ the total sample size, the sample then consists of $m=n/2$ pairs of individuals. In this matched setting with $K-1$ subtypes of cases, the total sample can be divided into $K-1$ sub-samples. For any $k\in[K]$ , the $k$ -th sub-sample ${\cal M}_{k}$ is made of the $m_{k}$ pairs composed by each case of subtype $k$ and his matched control. These sub-samples naturally define sub-groups, or strata, in the total sample. We should however stress that these strata differ from the “usual” strata defined in the context of conditional logistic regression models, the latter corresponding to case-control pairs. In other respect, for future use, we introduce the categorical variable $S$ which takes values in $[K-1]$ and indicates the sub-sample to which an observation belongs. In other words, $S=k$ for all observations in ${\cal M}_{k}$ .

Let us first focus on the $k$ -th stratum ${\cal M}_{k}$ , which is made of cases of subtype $k$ and their matched controls. For any matched pair $\ell$ of observations belonging to ${\cal M}_{k}$ , denote by ${\bf x}^{(k)}_{\ell,i}\in{\rm I}\kern-1.4pt{\rm R}^{p}$ , for some $p\geq 1$ , and $Y^{(k)}_{\ell,i}\in\{0,1\}$ , the two vectors of covariates and the two disease status indicators for the two observations $i\in\{1,2\}$ of the pair. Then, the association between risk factors and subtype $k$ of the disease can be studied by applying a conditional logistic regression model, restricted to observations in stratum ${\cal M}_{k}$ . Assume without loss of generality that data are arranged in such a way that the observation indexed $i=1$ is the case in each pair $\ell$ , that is $Y^{(k)}_{\ell,1}=1$ for all pairs $\ell$ . Then, as usual under the conditional logistic regression model, we assume the existence of a vector ${\bm{\delta}}_{k}^{*}\in{\rm I}\kern-1.4pt{\rm R}^{p}$ such that the probability that the case is the one observed in pair $\ell$ , given that a case is observed in pair $\ell$ , writes [Greenland, 2000]

[TABLE]

Vector ${\bm{\delta}}_{k}^{*}$ can then be estimated by maximizing the log conditional likelihood ${\bm{\delta}}_{k}\longrightarrow L^{(cond)}_{k}({\bm{\delta}}_{k})$ restricted to pairs in ${\cal M}_{k}$ , which is defined as

[TABLE]

where ${\bm{1}}_{m_{k}}=(1,\ldots,1)^{T}\in{\rm I}\kern-1.4pt{\rm R}^{m_{k}}$ and ${\bm{\Delta}}^{(k)}$ is the $m_{k}\times p$ matrix, whose $\ell$ -th row corresponds to $({\bf x}^{(k)}_{\ell,1}-{\bf x}^{(k)}_{\ell,2})$ , for $\ell\in[m_{k}]$ .

Equivalently, estimation of each ${\bm{\delta}}^{*}_{k}$ can be performed simultaneously, though still independently, by maximizing the criterion $\sum_{k\in[K-1]}L^{(cond)}_{k}({\bm{\delta}}_{k})=-[\log\{{\bm{1}}_{m}+\exp(\bar{\bm{\Delta}}{\bm{\delta}})\}]^{T}{\bm{1}}_{m}$ over ${\bm{\delta}}=({\bm{\delta}}^{T}_{1},\ldots,{\bm{\delta}}^{T}_{K-1})^{T}\in{\rm I}\kern-1.4pt{\rm R}^{(K-1)p}$ , with

[TABLE]

2.2 Standard $L_{1}$ norm penalized estimation

Several packages have been developed to maximize a penalized version of criterion (2): for instance, cLogitLasso is available within the R software [Avalos et al., 2015]; the cLogitL1 package [Reid and Tibshirani, 2014] can also be used, although it is not maintained on the CRAN anymore. For appropriate values of the regularization parameter $\lambda$ , they can be used to maximize the following criterion over ${\bm{\delta}}_{k}\in{\rm I}\kern-1.4pt{\rm R}^{p}$ ,

[TABLE]

to get a sparse estimate of ${\bm{\delta}}_{k}^{*}$ . They can also be used to maximize the “overall” criterion

[TABLE]

over ${\bm{\delta}}\in{\rm I}\kern-1.4pt{\rm R}^{(K-1)p}$ to get a sparse estimate of ${\bm{\delta}}^{*}=({\bm{\delta}}_{1}^{*},\ldots,{\bm{\delta}}_{K}^{*})$ . These two strategies are strictly identical and would return identical estimates. In particular, along both strategies, the estimation is performed independently on each stratum, that is independently for each subtype. This is likely sub-optimal when subtypes have commonalities. Indeed, these commonalities are expected to translate into some homogeneity among vectors ${\bm{\delta}}^{*}_{1},\ldots,{\bm{\delta}}^{*}_{K-1}$ , which may lead to improved estimation efficiency if properly accounted for.

2.3 Data shared lasso

Data shared lasso was independently proposed by Gross and Tibshirani (2016) [Gross and Tibshirani, 2016] and Ollier and Viallon (2017) [Ollier and Viallon, 2017] in the context of stratified regression models to account for the expected homogeneities among the parameter vectors to be estimated. The approach relies on the following over-parametrized decomposition for each parameter $\delta^{*}_{k,j}$ , for $k\in[K-1]$ and $j\in[p]$ ,

[TABLE]

Here $\mu^{*}_{j}$ can be seen as the “global” parameter for covariate $j$ and is common to all subtypes, while $\gamma^{*}_{k,j}$ captures the variation of the parameter for subtype $k$ around this global parameter. Even if decomposition (4) is over-parametrized, estimates of $\mu^{*}_{j}$ and $\gamma^{*}_{k,j}$ for $k\in[K-1]$ and $j\in[p]$ can be derived by maximizing the following criterion over $\bm{\mu}=(\mu_{1},\ldots,\mu_{p})$ and the ${\bm{\gamma}}_{k}$ ’s, with ${\bm{\gamma}}_{k}=(\gamma_{k,1},\ldots,\gamma_{k,p})$ ,

[TABLE]

The $L_{1}$ -norm penalty $\|\bm{\mu}\|_{1}$ encourages sparsity of the vector of global parameters, while $\|{\bm{\gamma}}_{k}\|_{1}$ encourages homogeneity among vectors $\widehat{\bm{\delta}}_{k}$ defined as $\widehat{\bm{\delta}}_{k}=\widehat{\bm{\mu}}+\widehat{\bm{\gamma}}_{k}$ , for $k\in[K-1]$ . For appropriate values of the regularization parameters $\lambda$ and $\tau_{k}$ , data shared lasso allows the estimation of parameters $\widehat{\mu}_{j}$ under one of the infinitely many decompositions of the form (4). Any particular choice for $\tau_{k}$ leads to a particular “definition” of the estimated global parameter $\widehat{\mu}_{j}$ for covariate $j$ . Given this particular definition, data shared lasso returns estimates $\widehat{\bm{\delta}}_{1},\ldots,\widehat{\bm{\delta}}_{K-1}$ that are typically close to $\widehat{\bm{\mu}}=(\widehat{\mu}_{1},\ldots,\widehat{\mu}_{p})$ in the $L_{1}$ -norm sense. For instance, if $\tau_{k}=1$ for all $k\in[K-1]$ , it showed that[Gross and Tibshirani, 2016, Ollier and Viallon, 2017]

[TABLE]

In other respect, several more standard approaches turn out to be special cases of data shared lasso. If $\tau_{k}=\infty$ for all $k$ , then $\widehat{\bm{\delta}}_{k}=\widehat{\bm{\mu}}$ for all $k\in[K-1]$ and data shared lasso reduces to the approach that consists in pooling all strata together; we will refer to this strategy as “Pooled”. “Pooled” overlooks the subtype specificities and generally leads to biased estimates of vectors ${\bm{\delta}}^{*}_{k}$ . On the other hand, for large enough values of $\lambda$ , we have $\widehat{\bm{\mu}}={\bf 0}$ and for appropriate values of parameters $\tau_{k}$ , data shared lasso reduces to estimating each vector ${\bm{\delta}}^{*}_{k}$ independently just as in (3) above; we will refer to this strategy as “Indep”. “Indep” overlooks the commonalities among the subtypes, hence typically leads to estimates with unnecessarily high variance. Finally, setting $\tau_{r}=\infty$ for one particular $r\in[K-1]$ corresponds to working under the constraint $\widehat{\bm{\beta}}_{r}=\widehat{\bm{\mu}}$ . In this case, data shared lasso reduces to another standard approach which consists in first selecting subtype $r$ as a reference, and then including interaction terms between each covariate and the indicator variables ${\rm 1}\kern-2.79999pt{\rm I}(S=k)$ for $k\neq r$ ; we will refer to this strategy as “Ref”. Note that, for any particular choice $r$ , the model complexity $C(r)$ is naturally defined as the number of non-zero parameters to be estimated, that is $C(r)=\|{\bm{\beta}}^{*}_{r}\|_{0}+\sum_{k\neq r}\|{\bm{\beta}}^{*}_{k}-{\bm{\beta}}^{*}_{r}\|_{0}$ , with $\|\cdot\|_{0}$ standing for the $L_{0}$ pseudo-norm. Consequently, the model complexity and estimation efficiency of “Ref” critically depend on the arbitrary choice of the reference stratum, that is the reference subtype in our case; see [Ollier and Viallon, 2017] for more details. Data shared lasso by-passes this arbitrary choice and, under stratified linear regression models, was shown to perform nearly as well as the oracular (and inapplicable) version of “Ref” based on an optimal and covariate-specific choice for the reference stratum.

Another nice property of data shared lasso is that it is readily implementable given any standard lasso solver. In particular, the data shared lasso criterion (5) above can be rewritten as

[TABLE]

with ${\bm{\tau}}=(\tau_{1},\ldots,\tau_{K-1})$ , ${\bm{\Gamma}}=(\bm{\mu}^{T},\tilde{\bm{\gamma}}_{1}^{T},\ldots,\tilde{\bm{\gamma}}^{T}_{K-1})$ and

[TABLE]

This criterion is exactly of the same form as (3): as a result, running cLogitLasso with the design matrix $\tilde{\bm{\Delta}}_{{\bm{\tau}}}$ returns a vector $(\widehat{\bm{\mu}}^{T},\hat{\tilde{\bm{\gamma}}}_{1}^{T},\ldots,\hat{\tilde{\bm{\gamma}}}^{T}_{K-1})$ from which a data shared lasso estimates $\widehat{\bm{\delta}}_{k}=\widehat{\bm{\mu}}+\hat{\tilde{\bm{\gamma}}}_{k}/\tau_{k}$ can be derived for ${\bm{\delta}}^{*}_{k}$ , $k\in[K-1]$ .

We will illustrate the performance of data shared lasso when analyzing matched case-control studies with multiple subtypes of cases through simulated examples in Section 4, as well as through the analysis of a case-control study for breast cancer nested in the EPIC cohort in Section 5.

3 Unmatched case-control studies with multiple subtypes of cases and sparse multinomial logistic models

We now turn our attention to the unmatched setting. When $K-1$ subtypes of cases are present for some given integer $K>1$ , the outcome $Y$ can be modeled as a categorical variable, taking values in $[K]$ . Hereafter, we will assume that $Y=K$ for controls, while $Y=k$ for cases of subtype $k$ , for any $k\in[K-1]$ . When no natural order exists among the categories of $Y$ , the multinomial logistic regression model is a natural extension of the standard logistic regression model. Below, we will recall some basics about the multinomial logistic regression model. In particular, we will present two formulations of this model under which $L_{1}$ -norm penalized estimation can be performed. We will then establish the relationship between these two approaches, basing our arguments on the data shared lasso ideas.

3.1 The multinomial logistic regression model

For ease of notation, we will mostly focus on models with no intercept. Then, in its symmetric formulation, the multinomial logistic regression model assumes the existence of $K$ vectors $({\bm{\beta}}^{*}_{1},\ldots,{\bm{\beta}}^{*}_{K})\in({\rm I}\kern-1.4pt{\rm R}^{p})^{K}$ such that

[TABLE]

for any value ${\bf x}_{0}\in{\rm I}\kern-1.4pt{\rm R}^{p}$ of the covariate vector, with $p\geq 1$ . Because $\sum_{k\in[K]}{\rm I}\kern-1.19995pt{\rm P}(Y=k|{\bf x}={\bf x}_{0})=1$ for any ${\bf x}_{0}\in{\rm I}\kern-1.4pt{\rm R}^{p}$ , this formulation is over-parametrized and vectors ${\bm{\beta}}^{*}_{1},\ldots,{\bm{\beta}}^{*}_{K}$ in Equation (6) are defined up to a constant only. More precisely, if model (6) holds with vectors ${\bm{\beta}}^{*}_{1},\ldots,{\bm{\beta}}^{*}_{K}$ , then it holds with vectors ${\bm{\beta}}^{*}_{1}+\bm{\nu},\ldots,{\bm{\beta}}^{*}_{K}+\bm{\nu}$ as well, for any $\bm{\nu}\in{\rm I}\kern-1.4pt{\rm R}^{p}$ .

To resolve this identifiability issue, a standard solution consists in selecting a reference category, say the $K$ -th one without loss of generality. This leads to the constraint ${\bm{\beta}}^{*}_{K}={\bf 0}_{p}$ in the formulation above, and the multinomial logistic regression model then reduces to assuming the existence of $({\bm{\delta}}^{*}_{1},\ldots,{\bm{\delta}}^{*}_{K-1})\in({\rm I}\kern-1.4pt{\rm R}^{p})^{K-1}$ such that

[TABLE]

Of course, the two formulations are strictly equivalent and from any “initial” vectors of parameters ${\bm{\beta}}^{*}_{1},\ldots,{\bm{\beta}}^{*}_{K}$ satisfying Equation (6), Equation (7) holds with vectors ${\bm{\delta}}^{*}_{1},\ldots,{\bm{\delta}}^{*}_{K-1}$ defined as ${\bm{\delta}}^{*}_{k}={\bm{\beta}}^{*}_{k}-{\bm{\beta}}^{*}_{K}$ , for $k\in[K]$ .

Vectors ${\bm{\delta}}^{*}_{1},\ldots,{\bm{\delta}}^{*}_{K-1}$ in Equation (7) can be estimated through likelihood maximization. Assume the data consists of $n$ independent and identically distributed replica $({\bf x}_{i},Y_{i})_{1\leq i\leq n}$ with ${\bf x}_{i}\in{\rm I}\kern-1.4pt{\rm R}^{p}$ and $Y_{i}\in[K]$ . Then, under model (7), the log-likelihood is defined for any $({\bm{\delta}}_{1},\ldots,{\bm{\delta}}_{K-1})\in({\rm I}\kern-1.4pt{\rm R}^{p})^{K-1}$ as

[TABLE]

where, for any collection of vectors $({\bf u}_{1},\ldots,{\bf u}_{K})\in{\rm I}\kern-1.4pt{\rm R}^{p\times K}$ , we set

[TABLE]

3.2 Sparse estimation under the standard formulation

A first sparse approach, that will be referred to as MultinomSparseRef here, simply consists in maximizing the $L_{1}$ -norm penalized version of the log-likelihood defined in (8)

[TABLE]

Maximizers of (10) can be obtained via the algorithm described in [Krishnapuram et al., 2005]. Thanks to the well-known link between the log-likelihood $L$ and the conditional logistic log-likelihood [Hendrickx et al., 2000], they can also be obtained as solutions returned by package cLogitLasso [Avalos et al., 2015] after a simple modification of the original data.

3.3 Sparse estimation under the symmetric formulation

Package glmnet in R [Friedman et al., 2010] implements an $L_{1}$ -penalized approach based on the symmetric formulation of the model, which will be referred to as MultinomSparseSym here. Parameter vectors ${\bm{\beta}}^{*}_{1},\ldots,{\bm{\beta}}^{*}_{K}$ used under formulation (6) cannot be estimated by standard maximum likelihood estimation because of the aforementioned lack of identifiability. But because penalizing acts as constraining, estimates of ${\bm{\beta}}^{*}_{1},\ldots,{\bm{\beta}}^{*}_{K}$ can be obtained as maximizers of the $L_{1}$ -penalized version of the following log-likelihood

[TABLE]

More precisely, package glmnet maximizes the following criterion over $({\bm{\beta}}_{1},\ldots,{\bm{\beta}}_{K})\in{{\rm I}\kern-1.4pt{\rm R}^{p}}^{K}$ ,

[TABLE]

for some appropriate value of the regularization parameter $\lambda$ . In [Friedman et al., 2010], it is shown that maximizers $\widehat{\bm{\beta}}_{1},\ldots,\widehat{\bm{\beta}}_{K}$ of this criterion are such that

[TABLE]

See the Appendix for an alternative proof of this result. Equation (11) establishes that penalizing by the $L_{1}$ -norm under the symmetric formulation of the model implicitly solves the lack of identifiability for each covariate by constraining the median of its parameters across the $K$ categories to be null.

We shall recall that when intercepts are considered, as is often the case in practice, they are generally not penalized. Setting ${\bm{\beta}}_{k}^{T}=(\beta_{k,0},{\bm{\beta}}_{k,\setminus 0}^{T})$ where $\beta_{k,0}$ stands for the intercept term for the $k$ -th category, the penalty term then becomes $\lambda\sum_{k=1}^{K}\|{\bm{\beta}}_{k,\setminus 0}\|_{1}$ . Then, identifiability issues are still present for the intercept terms under the symmetric formulation of the model. In the glmnet package, this is resolved by mean centering, which corresponds to imposing the constraint $\sum_{k\in[K]}\hat{\beta}_{k,0}=0$ [Friedman et al., 2010].

3.4 Relationship between MultinomSparseSym and MultinomSparseRef

Consider the standard formulation (7). When $Y=K$ stands for controls and $Y=k$ stands for cases of subtype $k$ for $k\in[K-1]$ , some level homogeneity among vectors ${\bm{\delta}}^{*}_{1},\ldots,{\bm{\delta}}^{*}_{K-1}$ is often expected. The ideas of data shared lasso can then be applied. When combined with MultinomSparseRef, data shared lasso first consists in considering the decomposition $\delta^{*}_{k,j}=\mu^{*}_{j}+\gamma^{*}_{k,j}$ for $k\in[K-1]$ and $j\in[p]$ , and then maximizing the following criterion, over $\bm{\mu}=(\mu_{1},\ldots,\mu_{p})$ and the ${\bm{\gamma}}_{k}$ ’s, with ${\bm{\gamma}}_{k}=(\gamma_{k,1},\ldots,\gamma_{k,p})$ ,

[TABLE]

We will refer to this approach as MultinomDataSharedRef hereafter. This criterion is of the same form as (5), for the particular choice $\tau_{k}=1$ for all $k\in[K-1]$ . Now, denote by $(\widehat{\bm{\beta}}_{1},\ldots,\widehat{\bm{\beta}}_{K})$ and $(\widehat{\bm{\mu}},\widehat{\bm{\gamma}}_{1},\ldots,\widehat{\bm{\gamma}}_{K-1})$ the solutions returned by MultinomSparseSym and MultinomDataSharedRef, respectively. In the Appendix we show that $\widehat{\bm{\mu}}=-\widehat{\bm{\beta}}_{K}$ and $\widehat{\bm{\beta}}_{k}=\widehat{\bm{\gamma}}_{k}$ for all $k\in[K-1]$ . This result formally establishes the equivalence between MultinomDataSharedRef and MultinomSparseSym: working under formulation (6) with an $L_{1}$ -norm penalty, as implemented in the glmnet package, exactly corresponds to working under formulation (7) with a data shared lasso penalty (for the particular choice $\tau_{k}=1$ for all $k\in[K-1]$ ) to encourage homogeneity among vectors $({\bm{\delta}}^{*}_{1},\ldots,{\bm{\delta}}^{*}_{K-1})$ .

To get a better understanding of the relationship between MultinomSparseSym and MultinomSparseRef, denote by $(\tilde{\bm{\beta}}_{1},\ldots,\tilde{\bm{\beta}}_{K})$ maximizers of the criterion ${\mathcal{L}}({\bm{\beta}}_{1},\ldots,{\bm{\beta}}_{K})-\lambda(\|{\bm{\beta}}_{K}\|_{1}+\sum_{k\in[K-1]}\|{\bm{\beta}}_{k}-{\bm{\beta}}_{K}\|_{1})$ . In the Appendix, we show that $\tilde{\bm{\beta}}_{K}={\bf 0}_{p}$ and $\tilde{\bm{\beta}}_{k}=\widehat{\bm{\delta}}_{k}$ for $k\in[K-1]$ , where $\widehat{\bm{\delta}}_{k}$ are estimates returned by MultinomSparseRef, that is maximizers of $L({\bm{\delta}}_{1},\ldots,{\bm{\delta}}_{K-1})-\lambda\sum_{k=1}^{K-1}\|{\bm{\delta}}_{k}\|_{1}$ . Therefore, applying MultinomSparseRef after selecting the $K$ -th category as the reference corresponds to working under the symmetric formulation and encouraging similarities between $\widehat{\bm{\beta}}_{K}$ and the other vectors $\widehat{\bm{\beta}}_{k}$ for $k\in[K-1]$ . This strategy is expected to perform best if ${\cal C}(K)=\sum_{k\in[K]}\|{\bm{\beta}}^{*}_{k}-{\bm{\beta}}^{*}_{K}\|_{0}$ is small. In other words, while the choice of the reference category has no effect whatsoever when estimation is done by maximizing the unpenalized log-likelihood (8), this choice is critical for MultinomSparseRef, that is when the $L_{1}$ -penalized log-likelihood (10) is maximized. This is closely related to our discussion about the performance of the “Ref” strategy described in Section 2.3 under matched designs (and more generally under stratified regression models), which also critically depends on the arbitrary choice for the reference stratum. For illustration, consider the following toy example where $(i)$ ${\bm{\beta}}^{*}_{1}=\cdots={\bm{\beta}}^{*}_{K-1}$ , and $(ii)$ $\beta^{*}_{K,j}\neq\beta^{*}_{1,j}$ for all $j\in[p]$ . When $Y=K$ indicates controls while $Y=k$ for $k\in[K-1]$ indicates cases of subtype $k$ , this situation arises when all subtypes are actually identical. Then we have ${\cal C}(K)=(K-1)p$ while ${\cal C}(r)=p$ for any $r<K$ , with ${\cal C}(r)=\sum_{k\in[K]}\|{\bm{\beta}}^{*}_{k}-{\bm{\beta}}^{*}_{r}\|_{0}$ standing for the model complexity when setting the reference category to $r$ before applying MultinomSparseRef. In this example, category $K$ is the worst choice for the reference when using MultinomSparseRef, even if it would be regarded as the most natural choice by many practitioners.

4 Simulation study

4.1 The matched setting

We performed a simulation study to assess the performance of data shared lasso in the context of matched case-control studies when $(K-1)=6$ subtypes of cases are present. We compared it with two more standard strategies: Indep and Ref. For the latter, the first subtype was selected as the subtype of reference. In addition, we implemented a cross-validation technique similar in spirit to the one-step lasso [Bühlmann and Meier, 2008] to select optimal regularization parameters and obtain final parameter estimates. To save computational time, data shared lasso and Ref were implemented with one particular choice for $\tau_{k}$ only, that is $\tau_{k}=1$ for all $k\in[K-1]$ .

We set the number of observations to $n=1000$ and the number of covariates was set to $p=100$ . Covariates were randomly generated under a multivariate Gaussian distribution ${\cal N}({\bf 0}_{p},{\bm{\Sigma}})$ , where ${\bm{\Sigma}}_{i,j}=(0.25^{2})\times 0.3^{|i-j|}$ . Pairs of observations were then created and randomly assigned to one stratum ${\cal M}_{k}$ in such a way that $m_{1}=200$ , $m_{2}=100$ and $m_{k}=50$ for $k=3,\ldots,6$ . Within each pair $\ell$ of each stratum ${\cal M}_{k}$ , the response variable $Y^{(k)}_{\ell,1}$ was then generated according to Equation (1), while $Y^{(k)}_{\ell,2}$ was set to $1-Y^{(k)}_{\ell,1}$ . As for parameters $\delta^{*}_{k,j}$ , they were defined as follows. One subset $J_{1}\subset[p]$ was first randomly selected, with $|J_{1}|=10$ . For $j\notin J_{1}$ , we set $\delta^{*}_{k,j}=0$ for all $k\in[K-1]$ . For $j\in J_{1}$ , four configurations were considered, allowing the level of homogeneity among $(\delta^{*}_{1,j},\ldots,\delta^{*}_{K-1,j})$ to vary. In the first configuration (full homogeneity), we set $\delta^{*}_{k,j}=\iota_{j}\delta$ , for some $\delta>0$ and with $\iota_{j}=\pm 1$ . In the second configuration (weak heterogeneity), for $j\in J_{1}$ , we randomly select one $k_{j}\in[K-1]$ , set $\delta^{*}_{k,j}=\iota_{k,j}\delta$ for $k\neq k_{j}$ and $\delta^{*}_{k_{j},j}=\iota_{k_{j},j}\delta*(1+U_{k_{j},j})$ , with each $\iota_{k,j}=\pm 1$ and $U_{k_{j},j}\sim{\cal U}_{[\sqrt{K}/2,2\sqrt{K}]}$ . In the third configuration (moderate heterogeneity), we randomly select three indices $(k_{j,1},k_{j,2},k_{j,3})\in[K-1]^{3}$ , set $\delta^{*}_{k,j}=\iota_{j}\delta$ $k\notin\{k_{j,1},k_{j,2},k_{j,3}\}$ and $\delta^{*}_{k,j}=\iota_{k,j}\delta*(1+U_{k,j})$ for $k\in\{k_{j,1},k_{j,2},k_{j,3}\}$ , with again $\iota_{k,j}=\pm 1$ and $U_{k,j}\sim{\cal U}_{[\sqrt{K}/2,2\sqrt{K}]}$ . Finally, in the fourth configuration (full heterogeneity), we set $\delta^{*}_{k,j}=\iota_{k,j}\delta*(1+U_{k,j})$ for $k\in[K-1]$ with again $\iota_{k,j}=\pm 1$ and $U_{k,j}\sim{\cal U}_{[\sqrt{K}/2,2\sqrt{K}]}$ . In each configuration, parameter $\delta$ varied in $\{0.4,1.0,2.0,3.0\}$ to study the impact of signal strength on the performance of the approaches.

One simulation design here corresponds to one particular combination of the value for $\delta$ and the level of heterogeneity. Fifty replications of each simulation design were performed and results presented below correspond to averages of the considered criteria over these 50 replicates for each approach.

Figure 1 presents the results regarding support recovery of the parameter matrix ${\bf D}^{*}=({\bm{\delta}}^{*}_{1},\ldots,{\bm{\delta}}^{*}_{K-1})\in{\rm I}\kern-1.4pt{\rm R}^{p\times(K-1)}$ (AccS; the higher, the better), the identification of heterogeneities among vectors $(\delta^{*}_{1,j},\ldots,\delta^{*}_{K-1,j})$ , for $j\in[p]$ (AccH; the higher, the better), as well as prediction error (Pred.Err; the lower, the better). Overall, the performance of Indep does not depend on the level of homogeneity, while those of DataShared and Ref typically increase with the homogeneity level. This was expected since Indep does not account for homogeneity, while DataShared and Ref do. In case of full homogeneity among vectors ${\bm{\delta}}_{1}^{*},\ldots,{\bm{\delta}}_{K-1}^{*}$ (Configuration 1), Ref and DataShared perform similarly regarding the three criteria, they perform as well as Pooled, and clearly outperform Indep. The similar performance of Ref and DataShared was expected in this particular case where model complexity $C(r)$ defined in Section 2.3 does not depend on $r$ . In case of full heterogeneity (Configuration 4), DataShared and Ref again perform similarly, as expected since $C(r)$ still does not depend on $r$ . Of course, they do not perform better than Indep in this case, but it is noteworthy that they do not perform worse either. In configurations 2 and 3 (weak and moderate heterogeneities), data shared lasso generally leads to the best results regarding prediction error and, to a lesser extent, support recovery and identification of the heterogeneities. In particular, it outperforms Ref, which confirms that by by-passing the arbitrary choice of the reference category, data shared lasso generally better accounts for homogeneity than Ref does when such homogeneity exists. These results are consistent with those obtained when evaluating data shared lasso under linear regression models [Ollier and Viallon, 2017] and binary graphical models [Ballout and Viallon, 2017].

4.2 The unmatched setting

We also performed a simulation study in the unmatched setting to illustrate the relative interests of MultinomSparseRef and MultinomSparseSym (the later being the same as MultinomDataSharedRef) depending on the level of homogeneity among vectors ${\bm{\delta}}^{*}_{1},\ldots,{\bm{\delta}}^{*}_{K-1}$ under formulation (7) or, equivalently, among vectors ${\bm{\beta}}^{*}_{1},\ldots,{\bm{\beta}}^{*}_{K}$ under formulation (6). Again, we chose $K=7$ . To save computational times, and because conclusions were consistant with those drawn in the matched case, a low-dimensional setting with $n=1000$ and $p=20$ was considered here. For data generation, we adapted the framework described in Section 4.1 to the unmatched setting using formulation (7). We used intercept terms, $(\delta_{0,1},\ldots,\delta_{0,K-1})$ , chosen in such a way that ${\rm I}\kern-1.19995pt{\rm P}(Y=K)=0.5$ and ${\rm I}\kern-1.19995pt{\rm P}(Y=k)$ ranged from 0.05 to 0.2 for $k\in[K-1]$ . In this low-dimensional setting, regularization parameters were selected as minimizers of the BIC after adapting the Lasso-OLS hybrid ideas to our context [Efron et al., 2004].

Figure 2 presents the results in this unmatched setting. They confirm that using data shared lasso (or, equivalently, the symmetric formulation) allows the homogeneity to be accounted for when present, which translates into better predictive performance, support recovery and identification of the heterogeneties. We shall also stress that even in the case of full heterogeneity, MultinomSparseSym performs as well as MultinomSparseRef, just as data shared lasso and Ref did in the matched setting case.

We further investigated in more details the poor performance of MultinomSparseRef. We focused on the particular case of full homogeneity among vectors ${\bm{\delta}}^{*}_{1},\ldots,{\bm{\delta}}^{*}_{K-1}$ under formulation (7). For one sample generated under configuration one (full homogeneity) with $\delta=3$ (corresponding to a large signal strength), we computed criteria AccS and AccH for the sequence of parameter vectors estimates returned by MultinomSparseRef and MultinomSparseSym for varying values of the regularization parameter $\lambda$ on appropriate grids $[\lambda_{1}/1000,\lambda_{1}]$ . Here $\lambda_{1}$ was set, as usual, as the minimal value for which the considered method returned a null parameter vector. MultinomSparseRef was actually ran with two particular choices for the reference category. We primarily chose category $K$ as in Figure 2. We recall that this choice is quite natural when category $K$ corresponds to controls. We also recall that in this case of full homogeneity among vectors ${\bm{\delta}}^{*}_{1},\ldots,{\bm{\delta}}^{*}_{K-1}$ , we have ${\cal C(K)}=10(K-1)=60$ while ${\cal C}(r)=10$ for any $r\neq K$ . We then also implemented MultinomSparseRef with reference category set to $1$ . Results returned by these two versions of MultinomSparseRef were compared to those returned by MultinomSparseSym (or equivalently, MultinomDataSharedRef). In each panel of Figure 3, each point represents values for AccS ( $x$ -axis) and AccH ( $y$ -axis) over the grid of regularization parameters used for the corresponding method. The choice of controls as the reference category (left panel, Ref $=K$ ), though standard, prevents MultinomSparseRef from visiting models with AccS greater than 0.75 whatever the value of the regularization parameter. On the other hand, choosing any subtype of cases as the reference (center panel, Ref $=1$ here) allows MultinomSparseRef to visit models with higher values for both AccS and AccH. Models visited by MultinomSparseSym are very similar to those visited by MultinomSparseRef with the optimal reference category. These results confirm that $(i)$ the performance of MultinomSparseRef critically depends on the arbitrary choice of the reference category when homogeneity is high, and $(ii)$ MultinomSparseSym (resp., equivalently, MultinomSparseRef with a data shared lasso penalty) by-passes (resp. corrects) the arbitrary choice of the reference category, and allows the visit of nearly the same models as those visited when applying MultinomSparseRef with the optimal choice for the reference category.

5 Application

5.1 The data

The European Prospective Investigation into Cancer and Nutrition (EPIC) study is an ongoing multicenter prospective study aiming to investigate prospectively the etiology of cancer in relation to diet, lifestyle and environmental factors, and for which the study design have been previously describe in detail [Riboli et al., 2002]. From 1992 to 2000, a total of 521,324 participants were recruited across 10 European countries, mostly from the general population, of which 70% are women, aged from 35 to 70 years. Among these participants, 246,000 women provided a blood sample at inclusion. Here, we present preliminary results from the analysis of a case-control study nested in EPIC, whose main objective was to assess the association between metabolites and the risk of subtypes of breast cancer. 1635 cases of breast cancer were included, along with 1635 matched controls (using incidence density sampling). For all these individuals, plasma samples collected at inclusion in the study were analyzed by mass spectrometry (AbsoluteIDQ p180 Kit) allowing the measurement of the levels of 127 metabolites. Those metabolites have been anonymized here since biological interpretation is out of the scope of this preliminary analysis. We considered six subtypes for cases, based on the presence/absence of hormone receptors: HER2-enriched, triple negative, Luminal A PR+, Luminal A PR-, Luminal B PR+ and Luminal B PR-.

5.2 Results

We estimated sparse conditional logistic regression models based on the Indep, Pooled, Ref and data shared lasso strategies described in Section 2. For the Ref strategy, Luminal A PR + was chosen as the reference subtype, which we believe would be considered as a natural choice by most practitioners because it is the most common subtype. Results are presented in Figure 4, where only metabolites identified as potential predictor of at least one breast cancer subtype by at least one approach have been retained. As expected, using either Ref or data shared lasso lead to much more interpretable results than the Indep and Pooled strategies when the objective is to identify potential heterogeneities across subtypes. Data shared lasso allows the identification of a few heterogeneities, in particular for the the most common subtype, Luminal A PR+. Interestingly, Ref was not able to identify any heterogeneities for this subtype: this is because it was used as the reference subtype. We shall however mention that no notable difference was observed in terms of prediction errors when comparing the models returned by Pooled, Ref and Data Shared Lasso (Indep was slightly worse than its competitors). This can be explained by the fact that the association between the metabolites and subtypes of breast cancer is rather limited. We still believe that this application nicely illustrates the potential benefit of the data shared lasso strategy which may help hierarchize the most probable heterogeneities between subtypes: in the present example, M96 might be of particular interest for Luminal A PR-, while M18, M27, M42, M43, M63 and M111 might be specific to Luminal A PR+.

6 Discussion

We considered the analysis of case-control studies when several subtypes of cases exist, which is increasingly common in cancer epidemiology. Considering both matched and unmatched settings, we showed that data shared lasso was a simple approach, which accounts for commonalities among the subtypes, when present, and improves estimation efficiency. In the unmatched setting, our observations provide practical guidance on how to chose between the two formulations of sparse multinomial logistic regression models, MultinomSparseSym and MultinomSparseRef. If a high level of homogeneity exists among vectors ${\bm{\beta}}^{*}_{1},\ldots,{\bm{\beta}}^{*}_{K}$ (or ${\bm{\delta}}^{*}_{1},\ldots,{\bm{\delta}}^{*}_{K-1})$ , then estimation efficiency is expected to be much higher when working with MultinomSparseSym (or equivalently MultinomDataSharedRef).

The estimation of several parameter vectors considered here is closely related to multi-task learning [Evgeniou and Pontil, 2004], for which a number of other structured sparsity inducing norms have been proposed in the literature, including the group lasso and generalized fused lasso [Lounici et al., 2011, Viallon et al., 2016]. However, we shall mention that the group lasso is not well suited when the identification of heterogeneities is of primary interest. On the other hand, the generalized fused lasso has shown good properties in the context of stratified regression models, both under generalized linear models [Viallon et al., 2016], survival models [Sennhenn-Reulen and Kneib, 2016] and binary graphical models [Ballout and Viallon, 2017]. Its extension to conditional logistic regression models or multinomial logistic models will be the focus of future work.

Acknowledgements

This work was partially supported by the French National Cancer Institute (L’Institut National du Cancer; INCA) (grant number 2015-166; PI: S. Rinaldi). The authors are grateful to the Principal Investigators of each of the EPIC centres for sharing the data we used in our illustrative example.

Appendix A Additional technical details

A.1 Proof of (11)

For any $\bm{\nu}\in{\rm I}\kern-1.4pt{\rm R}^{p}$ , maximizers $\widehat{\bm{\beta}}_{1},\ldots,\widehat{\bm{\beta}}_{K}$ of the criterion penalized in the glmnet package are such that:

[TABLE]

Therefore, $\sum_{k=1}^{K}\|\widehat{\bm{\beta}}_{k}\|_{1}\leq\sum_{k=1}^{K}\|\widehat{\bm{\beta}}_{k}-\bm{\nu}\|_{1}$ for all $\bm{\nu}\in{\rm I}\kern-1.4pt{\rm R}^{p}$ which establishes (11).

A.2 Equivalence between MultinomDataSharedRef and MultinomSparseSym

With the particular choice $\tau_{k}=1$ for all $k\in[K-1]$ , MultinomSparseSym consists in maximizing the criterion ${\mathcal{L}}({\bm{\beta}}_{1},\ldots,{\bm{\beta}}_{K})-\lambda\sum_{k=1}^{K}\|{\bm{\beta}}_{k}\|_{1}$ . For any given $({\bm{\beta}}_{1},\ldots,{\bm{\beta}}_{K})$ , set $\bm{\mu}=-{\bm{\beta}}_{K}$ and ${\bm{\gamma}}_{k}={\bm{\beta}}_{k}$ for all $k\in[K-1]$ . Then we have

[TABLE]

which is exactly the criterion maximized by MultinomDataSharedRef for the the particular choice $\tau_{k}=1$ for all $k\in[K-1]$ .

A.3 Matrix formulation of the log-likelihood (8)

Denote the indicator function by ${\rm 1}\kern-2.79999pt{\rm I}(\cdot)$ . For $k\in[K-1]$ , introduce ${\cal Y}^{(k)}\in{\rm I}\kern-1.4pt{\rm R}^{n}$ with ${\cal Y}^{(k)}_{i}={\rm 1}\kern-2.79999pt{\rm I}(Y_{i}=k)$ , for all $i\in[n]$ . Further introduce the vector of binary variables ${\cal Y}\in{\rm I}\kern-1.4pt{\rm R}^{n(K-1)}$ and the matrix ${\bm{\mathcal{X}}}\in{\rm I}\kern-1.4pt{\rm R}^{n(K-1)\times(K-1)p}$ defined as

[TABLE]

where ${\bf X}$ is the $n\times p$ matrix containing the $n$ observations $({\bf x}_{i})_{1\leq i\leq n}$ of the $p$ predictors. Finally set ${\bm{1}}_{n}$ the vector of length $n$ whose components are all equal to 1, and ${\bf J}=({\bf I}_{n},\ldots,{\bf I}_{n})$ the $n\times n(K-1)$ matrix whose each of the $(K-1)$ blocks is the identity matrix of order $n$ , ${\bf I}_{n}$ . Then, setting ${\bm{\delta}}=({\bm{\delta}}_{1}^{T},\ldots,{\bm{\delta}}_{K-1}^{T})^{T}$ , the log-likelihood (8) can be rewritten more compactly as

[TABLE]

Bibliography25

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[Avalos et al., 2015] Avalos, M., Pouyes, H., Grandvalet, Y., Orriols, L., and Lagarde, E. (2015). Sparse conditional logistic regression for analyzing large-scale matched data from epidemiological studies: a simple algorithm. BMC bioinformatics , 16(6):S 1.
2[Bach et al., 2010] Bach, F. et al. (2010). Self-concordant analysis for logistic regression. Electronic Journal of Statistics , 4:384–414.
3[Ballout and Viallon, 2017] Ballout, N. and Viallon, V. (2017). Structure estimation of binary graphical models on stratified data: application to the description of injury tables for victims of road accidents. ar Xiv preprint ar Xiv:1709.10298 .
4[Begg and Gray, 1984] Begg, C. B. and Gray, R. (1984). Calculation of polychotomous logistic regression parameters using individualized regressions. Biometrika , 71(1):11–18.
5[Bickel et al., 2009] Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics , pages 1705–1732.
6[Bühlmann and Meier, 2008] Bühlmann, P. and Meier, L. (2008). Discussion of “one-step sparse estimates in nonconcave penalized likelihood models” by h. zou and r. li. Ann. Statist , 36:1534–1541.
7[Efron et al., 2004] Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression (with discussion). The Annals of Statistics , 32:407–499.
8[Evgeniou and Pontil, 2004] Evgeniou, T. and Pontil, M. (2004). Regularized multi–task learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining , pages 109–117. ACM.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

Abstract

1 Introduction

2 Matched case-control studies with multiple subtypes of cases and stratified conditional logistic models

2.1 Setting

2.2 Standard L1L_{1}L1​ norm penalized estimation

2.3 Data shared lasso

3 Unmatched case-control studies with multiple subtypes of cases and sparse multinomial logistic models

3.1 The multinomial logistic regression model

3.2 Sparse estimation under the standard formulation

3.3 Sparse estimation under the symmetric formulation

3.4 Relationship between MultinomSparseSym and MultinomSparseRef

4 Simulation study

4.1 The matched setting

4.2 The unmatched setting

5 Application

5.1 The data

5.2 Results

6 Discussion

Acknowledgements

Appendix A Additional technical details

A.1 Proof of (11)

A.2 Equivalence between MultinomDataSharedRef and MultinomSparseSym

A.3 Matrix formulation of the log-likelihood (8)

2.2 Standard $L_{1}$ norm penalized estimation