Sparse estimation for case-control studies with multiple subtypes of cases
Nadim Ballout, Cedric Garcia, Vivian Viallon

TL;DR
This paper compares methods for analyzing case-control studies with multiple subtypes, proposing adaptations of stratified logistic regression and data shared lasso, and demonstrating their effectiveness through simulations and real data.
Contribution
It introduces a novel approach combining stratified conditional logistic regression with data shared lasso for subtype analysis in case-control studies.
Findings
Data shared lasso improves estimation when subtypes are homogeneous.
Symmetric multinomial logistic regression reduces to data shared lasso under certain conditions.
Proper modeling of subtype homogeneity enhances analysis accuracy.
Abstract
The analysis of case-control studies with several subtypes of cases is increasingly common, e.g. in cancer epidemiology. For matched designs, we show that a natural strategy is based on a stratified conditional logistic regression model. Then, to account for the potential homogeneity among the subtypes of cases, we adapt the ideas of data shared lasso, which has been recently proposed for the estimation of regression models in a stratified setting. For unmatched designs, we compare two standard methods based on L1-norm penalized multinomial logistic regression. We describe formal connections between these two approaches, from which practical guidance can be derived. We show that one of these approaches, which is based on a symmetric formulation of the multinomial logistic regression model, actually reduces to a data shared lasso version of the other. Consequently, the relative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Advanced Causal Inference Techniques · Statistical Methods and Bayesian Inference
Sparse estimation for case-control studies with multiple subtypes of cases. Nadim Ballout, Cedric Garcia and Vivian Viallon111Corresponding Author: [email protected]
Abstract
The analysis of case-control studies with several subtypes of cases is increasingly common, e.g. in cancer epidemiology. For matched designs, we show that a natural strategy is based on a stratified conditional logistic regression model. Then, to account for the potential homogeneity among the subtypes of cases, we adapt the ideas of data shared lasso, which has been recently proposed for the estimation of regression models in a stratified setting. For unmatched designs, we compare two standard methods based on L1-norm penalized multinomial logistic regression. We describe formal connections between these two approaches, from which practical guidance can be derived. We show that one of these approaches, which is based on a symmetric formulation of the multinomial logistic regression model, actually reduces to a data shared lasso version of the other. Consequently, the relative performance of the two approaches critically depends on the level of homogeneity that exists among the subtypes of cases: more precisely, when homogeneity is moderate to high, the non-symmetric formulation with controls as the reference is not recommended. Empirical results obtained from synthetic data are presented, which confirm the benefit of properly accounting for potential homogeneity under both matched and unmatched designs. We also present preliminary results from the analysis a case-control study nested within the EPIC cohort, where the objective is to identify metabolites associated with the occurrence of subtypes of breast cancer.
1 Introduction
The rise of -omics and other high-dimensional data (image, reimbursement claims, etc.) in medical science gives researchers access to numerous features that may predict outcomes of interest, like cancer development. However, this relatively cheap source of information comes at a price: the curse of dimensionality makes multivariate modeling of such data impossible without further assumptions. In other words, some prior information has to be properly accounted for to reduce dimensionality and accurately estimate high-dimensional multivariate models. Under parametric regression models, one common prior information, or assumption, is sparsity of the parameter vector. The use of -norm regularized approaches has been shown to yield optimal estimates when the true vector is sparse, under technical assumptions on the design matrix [Wainwright, 2009, Bach et al., 2010, Bickel et al., 2009]. As a result, penalized logistic models [McCullagh and Nelder, 1989, Park and Hastie, 2007, Wu et al., 2009] are now standard tools when studying risk factors of a disease in a high-dimensional setting.
However, for many diseases that were primarily considered as one single disease (breast cancer, colorectal cancer), several subtypes have now been recognized. They can either be histological, as for breast cancer, or anatomical, as for colorectal cancer. Even if commonalities may exist among these subtypes, they have their own specificities regarding both prognosis and etiology. For example, the cancer epidemiology community is now increasingly concerned with the identification of subtype specific risk factors for various cancer sites. This is the case in our motivating example presented in Section 5, which deals with the identification of metabolites associated with breast cancer subtypes, based on a matched case-control study nested in the EPIC (European Prospective Investigation into Cancer and nutrition) cohort study.
For unmatched case-control studies with multiple subtypes of cases, a natural extension of the binary logistic regression model is the multinomial logistic regression model [McCullagh and Nelder, 1989, Begg and Gray, 1984]. If denotes the number of subtypes for a given integer , inference under this model consists in estimating parameter vectors of size , where denotes the number of covariates (which may include interactions as well as an intercept term). On the other hand, for matched case-control studies with subtypes, the total sample can be decomposed into sub-samples, one for each subtype. Assuming for simplicity a 1:1 matching design, each sub-sample is made of pairs composed by one case of one particular subtype and one matched control. Then, each sub-sample can be analyzed separately, e.g. by applying a sparse conditional logistic regression model [Avalos et al., 2015]. Again, the overall analysis boils down to the estimation of parameter vectors of size .
Because commonalities exist between the subtypes of cases, some level of homogeneity is expected among those parameter vectors, in both the matched and unmatched settings. Properly accounting for this homogeneity is key to reduce the dimensionality and improve estimation efficiency. In the matched setting, sparse conditional logistic regression models have to be estimated on sub-groups, where these sub-groups are defined according to the subtype of the case of each pair (see Section 2 for more details). Then, inference falls into the framework of stratified regression modeling for which data shared lasso has recently been developed as a way to account for the expected homogeneity among the parameter vectors to be estimated [Gross and Tibshirani, 2016, Ollier and Viallon, 2017]. Under linear models, data shared lasso has been shown to enjoy good theoretical and empirical properties [Ollier and Viallon, 2017]. In addition, data shared lasso is easy to implement because it can be rewritten as a standard lasso after a simple transformation of the original data.
In this article, we will show how the ideas of data shared lasso can be applied to analyze case-control studies when multiple subtypes of cases are present. In Section 2, we start with the matched design and we show how data shared lasso can be used to estimate stratified sparse conditional logistic models. Section 3 is devoted to the unmatched setting, under which sparse multinomial logistic regression models are natural, as mentioned above. Actually, two formulations of sparse multinomial logistic regression models have been proposed in the literature. A first one, which we will refer to as the standard one, relies on the selection of a reference category and the estimation of parameter vectors [Begg and Gray, 1984]. Alternatively, a more symmetric formulation of the model can be adopted, where no reference category has to be selected and parameter vectors are to be estimated [Friedman et al., 2010]. Unpenalized estimation is impossible under this over-parametrized model due to a clear lack of identifiability. However, -penalized estimation can be performed, as implemented in the popular glmnet R package. To our best knowledge, no clear guidance exists in the literature on how to chose between the two formulations of sparse multinomial logistic regression models. We will formally establish that the -penalized strategies associated with the two formulations differ in the way they account for potential homogeneity among the parameter vectors to be estimated. More precisely, we show in Section 3 that -penalized estimates derived under the symmetric formulation coincide with the estimates derived under the standard formulation when using a data shared lasso penalty. In Section 4, we present results from a simulation study, which illustrate the interest of data shared lasso estimates when homogeneity exists among the parameter vectors to be estimated, under both the matched and unmatched settings. Section 5 is devoted to our illustrative example. Finally, concluding remarks are provided in Section 6.
2 Matched case-control studies with multiple subtypes of cases and stratified conditional logistic models
Conditional logistic regression is a standard tool for the analysis of matched case-control studies when a single type of cases is considered [Pearce, 2016, Rothman et al., 2008]. Here, we show how the ideas of data shared lasso can be applied to handle the situation where subtypes of cases are present, for some given integer .
2.1 Setting
Consider a matched case-control study where information about subtype is available for each case. We denote the number of subtypes by , for some given integer , and we will use the notation for any integer . For simplicity, we further assume a 1:1 matched case-control design where matching is based on some variables . Denoting by the total sample size, the sample then consists of pairs of individuals. In this matched setting with subtypes of cases, the total sample can be divided into sub-samples. For any , the -th sub-sample is made of the pairs composed by each case of subtype and his matched control. These sub-samples naturally define sub-groups, or strata, in the total sample. We should however stress that these strata differ from the “usual” strata defined in the context of conditional logistic regression models, the latter corresponding to case-control pairs. In other respect, for future use, we introduce the categorical variable which takes values in and indicates the sub-sample to which an observation belongs. In other words, for all observations in .
Let us first focus on the -th stratum , which is made of cases of subtype and their matched controls. For any matched pair of observations belonging to , denote by , for some , and , the two vectors of covariates and the two disease status indicators for the two observations of the pair. Then, the association between risk factors and subtype of the disease can be studied by applying a conditional logistic regression model, restricted to observations in stratum . Assume without loss of generality that data are arranged in such a way that the observation indexed is the case in each pair , that is for all pairs . Then, as usual under the conditional logistic regression model, we assume the existence of a vector such that the probability that the case is the one observed in pair , given that a case is observed in pair , writes [Greenland, 2000]
[TABLE]
Vector can then be estimated by maximizing the log conditional likelihood restricted to pairs in , which is defined as
[TABLE]
where and is the matrix, whose -th row corresponds to , for .
Equivalently, estimation of each can be performed simultaneously, though still independently, by maximizing the criterion over , with
[TABLE]
2.2 Standard norm penalized estimation
Several packages have been developed to maximize a penalized version of criterion (2): for instance, cLogitLasso is available within the R software [Avalos et al., 2015]; the cLogitL1 package [Reid and Tibshirani, 2014] can also be used, although it is not maintained on the CRAN anymore. For appropriate values of the regularization parameter , they can be used to maximize the following criterion over ,
[TABLE]
to get a sparse estimate of . They can also be used to maximize the “overall” criterion
[TABLE]
over to get a sparse estimate of . These two strategies are strictly identical and would return identical estimates. In particular, along both strategies, the estimation is performed independently on each stratum, that is independently for each subtype. This is likely sub-optimal when subtypes have commonalities. Indeed, these commonalities are expected to translate into some homogeneity among vectors , which may lead to improved estimation efficiency if properly accounted for.
2.3 Data shared lasso
Data shared lasso was independently proposed by Gross and Tibshirani (2016) [Gross and Tibshirani, 2016] and Ollier and Viallon (2017) [Ollier and Viallon, 2017] in the context of stratified regression models to account for the expected homogeneities among the parameter vectors to be estimated. The approach relies on the following over-parametrized decomposition for each parameter , for and ,
[TABLE]
Here can be seen as the “global” parameter for covariate and is common to all subtypes, while captures the variation of the parameter for subtype around this global parameter. Even if decomposition (4) is over-parametrized, estimates of and for and can be derived by maximizing the following criterion over and the ’s, with ,
[TABLE]
The -norm penalty encourages sparsity of the vector of global parameters, while encourages homogeneity among vectors defined as , for . For appropriate values of the regularization parameters and , data shared lasso allows the estimation of parameters under one of the infinitely many decompositions of the form (4). Any particular choice for leads to a particular “definition” of the estimated global parameter for covariate . Given this particular definition, data shared lasso returns estimates that are typically close to in the -norm sense. For instance, if for all , it showed that[Gross and Tibshirani, 2016, Ollier and Viallon, 2017]
[TABLE]
In other respect, several more standard approaches turn out to be special cases of data shared lasso. If for all , then for all and data shared lasso reduces to the approach that consists in pooling all strata together; we will refer to this strategy as “Pooled”. “Pooled” overlooks the subtype specificities and generally leads to biased estimates of vectors . On the other hand, for large enough values of , we have and for appropriate values of parameters , data shared lasso reduces to estimating each vector independently just as in (3) above; we will refer to this strategy as “Indep”. “Indep” overlooks the commonalities among the subtypes, hence typically leads to estimates with unnecessarily high variance. Finally, setting for one particular corresponds to working under the constraint . In this case, data shared lasso reduces to another standard approach which consists in first selecting subtype as a reference, and then including interaction terms between each covariate and the indicator variables for ; we will refer to this strategy as “Ref”. Note that, for any particular choice , the model complexity is naturally defined as the number of non-zero parameters to be estimated, that is , with standing for the pseudo-norm. Consequently, the model complexity and estimation efficiency of “Ref” critically depend on the arbitrary choice of the reference stratum, that is the reference subtype in our case; see [Ollier and Viallon, 2017] for more details. Data shared lasso by-passes this arbitrary choice and, under stratified linear regression models, was shown to perform nearly as well as the oracular (and inapplicable) version of “Ref” based on an optimal and covariate-specific choice for the reference stratum.
Another nice property of data shared lasso is that it is readily implementable given any standard lasso solver. In particular, the data shared lasso criterion (5) above can be rewritten as
[TABLE]
with , and
[TABLE]
This criterion is exactly of the same form as (3): as a result, running cLogitLasso with the design matrix returns a vector from which a data shared lasso estimates can be derived for , .
We will illustrate the performance of data shared lasso when analyzing matched case-control studies with multiple subtypes of cases through simulated examples in Section 4, as well as through the analysis of a case-control study for breast cancer nested in the EPIC cohort in Section 5.
3 Unmatched case-control studies with multiple subtypes of cases and sparse multinomial logistic models
We now turn our attention to the unmatched setting. When subtypes of cases are present for some given integer , the outcome can be modeled as a categorical variable, taking values in . Hereafter, we will assume that for controls, while for cases of subtype , for any . When no natural order exists among the categories of , the multinomial logistic regression model is a natural extension of the standard logistic regression model. Below, we will recall some basics about the multinomial logistic regression model. In particular, we will present two formulations of this model under which -norm penalized estimation can be performed. We will then establish the relationship between these two approaches, basing our arguments on the data shared lasso ideas.
3.1 The multinomial logistic regression model
For ease of notation, we will mostly focus on models with no intercept. Then, in its symmetric formulation, the multinomial logistic regression model assumes the existence of vectors such that
[TABLE]
for any value of the covariate vector, with . Because for any , this formulation is over-parametrized and vectors in Equation (6) are defined up to a constant only. More precisely, if model (6) holds with vectors , then it holds with vectors as well, for any .
To resolve this identifiability issue, a standard solution consists in selecting a reference category, say the -th one without loss of generality. This leads to the constraint in the formulation above, and the multinomial logistic regression model then reduces to assuming the existence of such that
[TABLE]
Of course, the two formulations are strictly equivalent and from any “initial” vectors of parameters satisfying Equation (6), Equation (7) holds with vectors defined as , for .
Vectors in Equation (7) can be estimated through likelihood maximization. Assume the data consists of independent and identically distributed replica with and . Then, under model (7), the log-likelihood is defined for any as
[TABLE]
where, for any collection of vectors , we set
[TABLE]
3.2 Sparse estimation under the standard formulation
A first sparse approach, that will be referred to as MultinomSparseRef here, simply consists in maximizing the -norm penalized version of the log-likelihood defined in (8)
[TABLE]
Maximizers of (10) can be obtained via the algorithm described in [Krishnapuram et al., 2005]. Thanks to the well-known link between the log-likelihood and the conditional logistic log-likelihood [Hendrickx et al., 2000], they can also be obtained as solutions returned by package cLogitLasso [Avalos et al., 2015] after a simple modification of the original data.
3.3 Sparse estimation under the symmetric formulation
Package glmnet in R [Friedman et al., 2010] implements an -penalized approach based on the symmetric formulation of the model, which will be referred to as MultinomSparseSym here. Parameter vectors used under formulation (6) cannot be estimated by standard maximum likelihood estimation because of the aforementioned lack of identifiability. But because penalizing acts as constraining, estimates of can be obtained as maximizers of the -penalized version of the following log-likelihood
[TABLE]
More precisely, package glmnet maximizes the following criterion over ,
[TABLE]
for some appropriate value of the regularization parameter . In [Friedman et al., 2010], it is shown that maximizers of this criterion are such that
[TABLE]
See the Appendix for an alternative proof of this result. Equation (11) establishes that penalizing by the -norm under the symmetric formulation of the model implicitly solves the lack of identifiability for each covariate by constraining the median of its parameters across the categories to be null.
We shall recall that when intercepts are considered, as is often the case in practice, they are generally not penalized. Setting where stands for the intercept term for the -th category, the penalty term then becomes . Then, identifiability issues are still present for the intercept terms under the symmetric formulation of the model. In the glmnet package, this is resolved by mean centering, which corresponds to imposing the constraint [Friedman et al., 2010].
3.4 Relationship between MultinomSparseSym and MultinomSparseRef
Consider the standard formulation (7). When stands for controls and stands for cases of subtype for , some level homogeneity among vectors is often expected. The ideas of data shared lasso can then be applied. When combined with MultinomSparseRef, data shared lasso first consists in considering the decomposition for and , and then maximizing the following criterion, over and the ’s, with ,
[TABLE]
We will refer to this approach as MultinomDataSharedRef hereafter. This criterion is of the same form as (5), for the particular choice for all . Now, denote by and the solutions returned by MultinomSparseSym and MultinomDataSharedRef, respectively. In the Appendix we show that and for all . This result formally establishes the equivalence between MultinomDataSharedRef and MultinomSparseSym: working under formulation (6) with an -norm penalty, as implemented in the glmnet package, exactly corresponds to working under formulation (7) with a data shared lasso penalty (for the particular choice for all ) to encourage homogeneity among vectors .
To get a better understanding of the relationship between MultinomSparseSym and MultinomSparseRef, denote by maximizers of the criterion . In the Appendix, we show that and for , where are estimates returned by MultinomSparseRef, that is maximizers of . Therefore, applying MultinomSparseRef after selecting the -th category as the reference corresponds to working under the symmetric formulation and encouraging similarities between and the other vectors for . This strategy is expected to perform best if is small. In other words, while the choice of the reference category has no effect whatsoever when estimation is done by maximizing the unpenalized log-likelihood (8), this choice is critical for MultinomSparseRef, that is when the -penalized log-likelihood (10) is maximized. This is closely related to our discussion about the performance of the “Ref” strategy described in Section 2.3 under matched designs (and more generally under stratified regression models), which also critically depends on the arbitrary choice for the reference stratum. For illustration, consider the following toy example where , and for all . When indicates controls while for indicates cases of subtype , this situation arises when all subtypes are actually identical. Then we have while for any , with standing for the model complexity when setting the reference category to before applying MultinomSparseRef. In this example, category is the worst choice for the reference when using MultinomSparseRef, even if it would be regarded as the most natural choice by many practitioners.
4 Simulation study
4.1 The matched setting
We performed a simulation study to assess the performance of data shared lasso in the context of matched case-control studies when subtypes of cases are present. We compared it with two more standard strategies: Indep and Ref. For the latter, the first subtype was selected as the subtype of reference. In addition, we implemented a cross-validation technique similar in spirit to the one-step lasso [Bühlmann and Meier, 2008] to select optimal regularization parameters and obtain final parameter estimates. To save computational time, data shared lasso and Ref were implemented with one particular choice for only, that is for all .
We set the number of observations to and the number of covariates was set to . Covariates were randomly generated under a multivariate Gaussian distribution , where . Pairs of observations were then created and randomly assigned to one stratum in such a way that , and for . Within each pair of each stratum , the response variable was then generated according to Equation (1), while was set to . As for parameters , they were defined as follows. One subset was first randomly selected, with . For , we set for all . For , four configurations were considered, allowing the level of homogeneity among to vary. In the first configuration (full homogeneity), we set , for some and with . In the second configuration (weak heterogeneity), for , we randomly select one , set for and , with each and . In the third configuration (moderate heterogeneity), we randomly select three indices , set and for , with again and . Finally, in the fourth configuration (full heterogeneity), we set for with again and . In each configuration, parameter varied in to study the impact of signal strength on the performance of the approaches.
One simulation design here corresponds to one particular combination of the value for and the level of heterogeneity. Fifty replications of each simulation design were performed and results presented below correspond to averages of the considered criteria over these 50 replicates for each approach.
Figure 1 presents the results regarding support recovery of the parameter matrix (AccS; the higher, the better), the identification of heterogeneities among vectors , for (AccH; the higher, the better), as well as prediction error (Pred.Err; the lower, the better). Overall, the performance of Indep does not depend on the level of homogeneity, while those of DataShared and Ref typically increase with the homogeneity level. This was expected since Indep does not account for homogeneity, while DataShared and Ref do. In case of full homogeneity among vectors (Configuration 1), Ref and DataShared perform similarly regarding the three criteria, they perform as well as Pooled, and clearly outperform Indep. The similar performance of Ref and DataShared was expected in this particular case where model complexity defined in Section 2.3 does not depend on . In case of full heterogeneity (Configuration 4), DataShared and Ref again perform similarly, as expected since still does not depend on . Of course, they do not perform better than Indep in this case, but it is noteworthy that they do not perform worse either. In configurations 2 and 3 (weak and moderate heterogeneities), data shared lasso generally leads to the best results regarding prediction error and, to a lesser extent, support recovery and identification of the heterogeneities. In particular, it outperforms Ref, which confirms that by by-passing the arbitrary choice of the reference category, data shared lasso generally better accounts for homogeneity than Ref does when such homogeneity exists. These results are consistent with those obtained when evaluating data shared lasso under linear regression models [Ollier and Viallon, 2017] and binary graphical models [Ballout and Viallon, 2017].
4.2 The unmatched setting
We also performed a simulation study in the unmatched setting to illustrate the relative interests of MultinomSparseRef and MultinomSparseSym (the later being the same as MultinomDataSharedRef) depending on the level of homogeneity among vectors under formulation (7) or, equivalently, among vectors under formulation (6). Again, we chose . To save computational times, and because conclusions were consistant with those drawn in the matched case, a low-dimensional setting with and was considered here. For data generation, we adapted the framework described in Section 4.1 to the unmatched setting using formulation (7). We used intercept terms, , chosen in such a way that and ranged from 0.05 to 0.2 for . In this low-dimensional setting, regularization parameters were selected as minimizers of the BIC after adapting the Lasso-OLS hybrid ideas to our context [Efron et al., 2004].
Figure 2 presents the results in this unmatched setting. They confirm that using data shared lasso (or, equivalently, the symmetric formulation) allows the homogeneity to be accounted for when present, which translates into better predictive performance, support recovery and identification of the heterogeneties. We shall also stress that even in the case of full heterogeneity, MultinomSparseSym performs as well as MultinomSparseRef, just as data shared lasso and Ref did in the matched setting case.
We further investigated in more details the poor performance of MultinomSparseRef. We focused on the particular case of full homogeneity among vectors under formulation (7). For one sample generated under configuration one (full homogeneity) with (corresponding to a large signal strength), we computed criteria AccS and AccH for the sequence of parameter vectors estimates returned by MultinomSparseRef and MultinomSparseSym for varying values of the regularization parameter on appropriate grids . Here was set, as usual, as the minimal value for which the considered method returned a null parameter vector. MultinomSparseRef was actually ran with two particular choices for the reference category. We primarily chose category as in Figure 2. We recall that this choice is quite natural when category corresponds to controls. We also recall that in this case of full homogeneity among vectors , we have while for any . We then also implemented MultinomSparseRef with reference category set to . Results returned by these two versions of MultinomSparseRef were compared to those returned by MultinomSparseSym (or equivalently, MultinomDataSharedRef). In each panel of Figure 3, each point represents values for AccS (-axis) and AccH (-axis) over the grid of regularization parameters used for the corresponding method. The choice of controls as the reference category (left panel, Ref), though standard, prevents MultinomSparseRef from visiting models with AccS greater than 0.75 whatever the value of the regularization parameter. On the other hand, choosing any subtype of cases as the reference (center panel, Ref here) allows MultinomSparseRef to visit models with higher values for both AccS and AccH. Models visited by MultinomSparseSym are very similar to those visited by MultinomSparseRef with the optimal reference category. These results confirm that the performance of MultinomSparseRef critically depends on the arbitrary choice of the reference category when homogeneity is high, and MultinomSparseSym (resp., equivalently, MultinomSparseRef with a data shared lasso penalty) by-passes (resp. corrects) the arbitrary choice of the reference category, and allows the visit of nearly the same models as those visited when applying MultinomSparseRef with the optimal choice for the reference category.
5 Application
5.1 The data
The European Prospective Investigation into Cancer and Nutrition (EPIC) study is an ongoing multicenter prospective study aiming to investigate prospectively the etiology of cancer in relation to diet, lifestyle and environmental factors, and for which the study design have been previously describe in detail [Riboli et al., 2002]. From 1992 to 2000, a total of 521,324 participants were recruited across 10 European countries, mostly from the general population, of which 70% are women, aged from 35 to 70 years. Among these participants, 246,000 women provided a blood sample at inclusion. Here, we present preliminary results from the analysis of a case-control study nested in EPIC, whose main objective was to assess the association between metabolites and the risk of subtypes of breast cancer. 1635 cases of breast cancer were included, along with 1635 matched controls (using incidence density sampling). For all these individuals, plasma samples collected at inclusion in the study were analyzed by mass spectrometry (AbsoluteIDQ p180 Kit) allowing the measurement of the levels of 127 metabolites. Those metabolites have been anonymized here since biological interpretation is out of the scope of this preliminary analysis. We considered six subtypes for cases, based on the presence/absence of hormone receptors: HER2-enriched, triple negative, Luminal A PR+, Luminal A PR-, Luminal B PR+ and Luminal B PR-.
5.2 Results
We estimated sparse conditional logistic regression models based on the Indep, Pooled, Ref and data shared lasso strategies described in Section 2. For the Ref strategy, Luminal A PR + was chosen as the reference subtype, which we believe would be considered as a natural choice by most practitioners because it is the most common subtype. Results are presented in Figure 4, where only metabolites identified as potential predictor of at least one breast cancer subtype by at least one approach have been retained. As expected, using either Ref or data shared lasso lead to much more interpretable results than the Indep and Pooled strategies when the objective is to identify potential heterogeneities across subtypes. Data shared lasso allows the identification of a few heterogeneities, in particular for the the most common subtype, Luminal A PR+. Interestingly, Ref was not able to identify any heterogeneities for this subtype: this is because it was used as the reference subtype. We shall however mention that no notable difference was observed in terms of prediction errors when comparing the models returned by Pooled, Ref and Data Shared Lasso (Indep was slightly worse than its competitors). This can be explained by the fact that the association between the metabolites and subtypes of breast cancer is rather limited. We still believe that this application nicely illustrates the potential benefit of the data shared lasso strategy which may help hierarchize the most probable heterogeneities between subtypes: in the present example, M96 might be of particular interest for Luminal A PR-, while M18, M27, M42, M43, M63 and M111 might be specific to Luminal A PR+.
6 Discussion
We considered the analysis of case-control studies when several subtypes of cases exist, which is increasingly common in cancer epidemiology. Considering both matched and unmatched settings, we showed that data shared lasso was a simple approach, which accounts for commonalities among the subtypes, when present, and improves estimation efficiency. In the unmatched setting, our observations provide practical guidance on how to chose between the two formulations of sparse multinomial logistic regression models, MultinomSparseSym and MultinomSparseRef. If a high level of homogeneity exists among vectors (or , then estimation efficiency is expected to be much higher when working with MultinomSparseSym (or equivalently MultinomDataSharedRef).
The estimation of several parameter vectors considered here is closely related to multi-task learning [Evgeniou and Pontil, 2004], for which a number of other structured sparsity inducing norms have been proposed in the literature, including the group lasso and generalized fused lasso [Lounici et al., 2011, Viallon et al., 2016]. However, we shall mention that the group lasso is not well suited when the identification of heterogeneities is of primary interest. On the other hand, the generalized fused lasso has shown good properties in the context of stratified regression models, both under generalized linear models [Viallon et al., 2016], survival models [Sennhenn-Reulen and Kneib, 2016] and binary graphical models [Ballout and Viallon, 2017]. Its extension to conditional logistic regression models or multinomial logistic models will be the focus of future work.
Acknowledgements
This work was partially supported by the French National Cancer Institute (L’Institut National du Cancer; INCA) (grant number 2015-166; PI: S. Rinaldi). The authors are grateful to the Principal Investigators of each of the EPIC centres for sharing the data we used in our illustrative example.
Appendix A Additional technical details
A.1 Proof of (11)
For any , maximizers of the criterion penalized in the glmnet package are such that:
[TABLE]
Therefore, for all which establishes (11).
A.2 Equivalence between MultinomDataSharedRef and MultinomSparseSym
With the particular choice for all , MultinomSparseSym consists in maximizing the criterion . For any given , set and for all . Then we have
[TABLE]
which is exactly the criterion maximized by MultinomDataSharedRef for the the particular choice for all .
A.3 Matrix formulation of the log-likelihood (8)
Denote the indicator function by . For , introduce with , for all . Further introduce the vector of binary variables and the matrix defined as
[TABLE]
where is the matrix containing the observations of the predictors. Finally set the vector of length whose components are all equal to 1, and the matrix whose each of the blocks is the identity matrix of order , . Then, setting , the log-likelihood (8) can be rewritten more compactly as
[TABLE]
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[Avalos et al., 2015] Avalos, M., Pouyes, H., Grandvalet, Y., Orriols, L., and Lagarde, E. (2015). Sparse conditional logistic regression for analyzing large-scale matched data from epidemiological studies: a simple algorithm. BMC bioinformatics , 16(6):S 1.
- 2[Bach et al., 2010] Bach, F. et al. (2010). Self-concordant analysis for logistic regression. Electronic Journal of Statistics , 4:384–414.
- 3[Ballout and Viallon, 2017] Ballout, N. and Viallon, V. (2017). Structure estimation of binary graphical models on stratified data: application to the description of injury tables for victims of road accidents. ar Xiv preprint ar Xiv:1709.10298 .
- 4[Begg and Gray, 1984] Begg, C. B. and Gray, R. (1984). Calculation of polychotomous logistic regression parameters using individualized regressions. Biometrika , 71(1):11–18.
- 5[Bickel et al., 2009] Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics , pages 1705–1732.
- 6[Bühlmann and Meier, 2008] Bühlmann, P. and Meier, L. (2008). Discussion of “one-step sparse estimates in nonconcave penalized likelihood models” by h. zou and r. li. Ann. Statist , 36:1534–1541.
- 7[Efron et al., 2004] Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression (with discussion). The Annals of Statistics , 32:407–499.
- 8[Evgeniou and Pontil, 2004] Evgeniou, T. and Pontil, M. (2004). Regularized multi–task learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining , pages 109–117. ACM.
