Scalable Holistic Linear Regression
Dimitris Bertsimas, Michael Lingzhi Li

TL;DR
This paper introduces a scalable holistic linear regression algorithm that models significance and multicollinearity as lazy constraints, significantly improving scalability and accuracy over previous methods.
Contribution
The paper presents a novel theory and algorithm that enhance scalability and performance of holistic linear regression by modeling key conditions as lazy constraints.
Findings
Scales with thousands of samples, outperforming previous methods.
Improves accuracy and reduces false detection rate.
Reduces computational time significantly.
Abstract
We propose a new scalable algorithm for holistic linear regression building on Bertsimas & King (2016). Specifically, we develop new theory to model significance and multicollinearity as lazy constraints rather than checking the conditions iteratively. The resulting algorithm scales with the number of samples in the 10,000s, compared to the low 100s in the previous framework. Computational results on real and synthetic datasets show it greatly improves from previous algorithms in accuracy, false detection rate, computational time and scalability.
| Noise | ACC | FPR | Time | |||||
| 1000 | 100 | 3 | 1 | 1 | 0.27s | |||
| 1000 | 500 | 3 | 1 | 1 | 2.37s | |||
| 1000 | 1000 | 3 | 1 | 1 | 20.23s | |||
| 1000 | 500 | 5 | 3 | 2 | 33.40s | |||
| 1000 | 1000 | 5 | 3 | 2 | 5940.56s | |||
| 1000 | 500 | 3 | 1 | 1 | 2.29s | |||
| 1000 | 1000 | 3 | 1 | 1 | 32.17s |
| Dataset | Method | Loss | Sign. | MA | ||||
|---|---|---|---|---|---|---|---|---|
| Airfoil | 1502 | 5 | Holistic | 3 | 558 | 50s | - | |
| Tamura | 4 | 562 | 19s | - | ||||
| Chung | 3 | 558 | 107s | - | ||||
| Lasso | 4 | 570 | 7s | - | ||||
| Bootstrap | 4 | 564 | 39870s | - | ||||
| Cancer | 568 | 29 | Holistic | 7 | 1.71 | 54s | - | |
| Tamura | 11 | 1.90 | 410s | - | ||||
| Chung | 7 | 1.71 | 310s | - | ||||
| Lasso | 23 | 0.72 | 10s | - | ||||
| Bootstrap | 12 | 2.22 | 60000s | - | ||||
| Parkinsons | 5875 | 16 | Holistic | 1 | 533 | 403s (15.2s) | ||
| Tamura | 1 | 533 | ||||||
| Chung | 3 | 549 | 876s | |||||
| Lasso | 3 | 522 | 14s | |||||
| Bootstrap | 3 | 571 | 60000s | |||||
| Air Quality | 9358 | 12 | Holistic | 4 | 89.2 | 380s | - | |
| Tamura | N/A | N/A | N/A | N/A | - | |||
| Chung | 6 | 96.1 | 770s | - | ||||
| Lasso | 9 | 83.7 | 11s | - | ||||
| Bootstrap | 5 | 89.6 | 58146s | - | ||||
| Crime | 2215 | 125 | Holistic | 9 | 180 | 725s (120.3s) | ||
| Tamura | N/A | N/A | N/A | N/A | N/A | |||
| Chung | 11 | 207 | 1506s | |||||
| Lasso | 19 | 172 | 21s | |||||
| Bootstrap | 12 | 195 | 60000s |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Regression
Scalable Holistic Linear Regression
Dimitris Bertsimas
Sloan School of Management and Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA 02139
Michael Lingzhi Li
Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA 02139
Abstract
We propose a new scalable algorithm for holistic linear regression building on Bertsimas & King (2016). Specifically, we develop new theory to model significance and multicollinearity as lazy constraints rather than checking the conditions iteratively. The resulting algorithm scales with the number of samples in the 10,000s, compared to the low 100s in the previous framework. Computational results on real and synthetic datasets show it greatly improves from previous algorithms in accuracy, false detection rate, computational time and scalability.
keywords:
Holistic Linear Regression , Multicollinearity and Significance in Linear Regression , Mixed-Integer Optimization
††journal: Operation Research Letters
1 Introduction
In this paper, we continue the research program initiated in [1] to develop an algorithmic approach for holistic linear regression in which we impose desirable properties simultaneously and a priori. Using mixed integer optimization (MIO) the earlier proposal modeled sparsity, pairwise collinearity and group sparsity using explicit constraints but accounted for significance and multicollinearity through a cutting plane and bootstrap approach. The difficulty of using the cutting plane method is that it often requires a large number of iterations to ensure that the model has appropriate significance and does not exhibit multicollinearity. This results in an algorithm that does not scale beyond , the number of samples, in the low 100s when accounting for significance and multicollinearity.
Our goal in this paper is to propose a new scalable algorithm for holistic linear regression. We propose a new way to impose significance and multicollinearity constraints explicitly that scales with in the 10,000s. This allows us to build linear regression models much more effectively and accurately than in earlier works.
In our view, scalable holistic regression is important at it allows linear regression models to have interpretability, robustness, significance and accuracy in a systematic way. In contrast, today the practice of regression is more of an art than science. Continuing on the vision in [1], the paper aspires to scale holistic regression further and make these methods easier to use in much larger problems.
The standard methodology for imposing significance in linear regression is to use the Student -statistic. However, the test is carried out after the linear regression model has been calculated, and does not optimally select a subset of covariates that are significant a priori. In the [1] framework, summarized in Section 2, significance is imposed iteratively leading to a cutting plane algorithm. [2] explored significance of coefficients by adding heuristic constraints to set lower bounds on the coefficients. [3] used lazy constraints to ensure exact significance tests while deriving theoretical bounds for minimum power. In contrast to [2] and similar to [3], we use lazy constraints to ensure minimum power. However, instead of using the -statistic, we appeal to the asymptotic normality results instead.
For multicollinearity, in a landmark paper [4] comprehensively reviewed the problem and concluded that there is no accepted way of dealing with this problem, citing “there is a lack of attention for this problem in the statistics community.” Various methods employed include principal component analysis to select the top variables to avoid multicollinear combinations, and variance inflation factors [5] that provide a numerical quantity to determine how much the variance of a coefficient has been increased due to correlation with other variables. [6, 7] explored incorporating multicollinearity constraints using variance inflation factors (VIFs) and condition numbers (CNs) respectively. However, both of these concepts are only approximations of true multicollinear relationships. It is true that multicollinear relationships are sufficient for high VIF and CN, but they are not necessary, as shown in [8] and [9], respectively. That means constraining on VIF or CN would potentially produce extra constraints that are not needed for solving multicollinearity. In this paper, we introduce new theory that provides both necessary and sufficient guarantees in relation to detecting multicollinearity.
Specifically, our contributions in this paper are as follows:
We continue the program in [1] and extend the formulation with significance constraints a priori. 2. 2.
We develop a new theory of detecting multicollinearity by connecting multicollinearity to the eigenvectors of the design matrix , where is the matrix of the given data and use it to impose multicollinearity constraints within an MIO framework 3. 3.
We present computational results on real and synthetic datasets that suggest the overall algorithm for holistic regression scales with in the 10,000s, while the method in [1] scales with in the low 100s when accounting for significance and multicollinearity.
The structure of the paper is as follows. In Section 2, we review the work in [1] on constructing a holistic framework for linear regression. In Section 3, we introduce the -statistic formulation to model significance. In Section 4, we introduce a new formulation to model multicollinearity and present computational results with synthetic and real-world data that show the effectiveness of the method. In Section 5, we combine our proposals with the framework introduced in [1], and compare its performance with recent models in the literature using real-world data.
2 The Framework of [1]
Given data , , , , [1] propose the following MIO:
[TABLE]
The term in the objective function (1) models robustness as seen in [10], who established the equivalence of the penalization and robustness.
Constraints (2) and (3) model sparsity using the Big- framework that at most out of the variables are selected in the linear regression model. In this paper, the specification of follows from that in [1]. Constraint (4) models group sparsity, i.e., variables in the set are either all selected or none is selected. Finally, pairwise collinearity is modeled in Constraint (2) where HC is the set
[TABLE]
for some predefined correlation cutoff.
[1] apply the following iterative process to include constraints for significance and multicollinearity:
Solve the MIO (1)-(2) to obtain a subset of the coefficients . 2. 2.
For the set the algorithm computes the significance levels for each of the variables via bootstrap methods, and calculates the condition number of the model. If a set produces undesirable results – a condition number higher than desired, or a model with insignificant variables – the algorithm generates the constraint
[TABLE]
to exclude the set from consideration. The algorithm adds the constraint to Problem (1) and repeats the process until no such set is found.
[1] report computational results that demonstrate that Model (1) is effective to solve problems up to in the 1000s. However, when we include significance and multicollinearity constraints in a cutting plane methodology, the method scales up to in the low 100s and some times no solution is found after considerable computation time.
3 Imposing Significance Constraints
Variable significance has long been one of the most important elements in linear regression, and has served as a proxy for variable selection and causality studies.
We first restate a standard result about the asymptotic guarantee of the normality of the least squares estimate of to serve as the basis of our approach. For a linear regression problem:
[TABLE]
We have the following theorem, as proven in [11]:
Theorem 1**.**
If is iid with and for all , and is invertible, then we have:
[TABLE]
Where is the least squares estimate of with
[TABLE]
Note that normality is not part of the assumption here - in contrast to the -test statistics used in [3]. Therefore, using such asymptotic results, we would assume when large enough, we have that:
[TABLE]
Where and
[TABLE]
is the least squares estimate of standard deviation .
3.1 Constructing Significance Constraints
For a test of size , we first define the quantity , the inverse cdf of the distribution at point . Then, we can impose the normality test by requiring:
[TABLE]
Where is the model matrix constrained to the columns where . This is equivalent to the big -constraints:
[TABLE]
where is a large constant. These two constraints are used to model significance of level without the need of the bootstrap. As the model matrix changes with the selection of , in implementation these constraints are implemented as lazy constraints to only be enforced when a feasible integer solution is reached, in a similar fashion to [3].
In interest of brevity, we defer computational experiments and present results when combined with the multicollinearity detection as illustrated below.
4 Multicollinearity Detection
Given data , we would like the design matrix to be free of multicollinear relationships so that is not very close to 0. We denote the columns of as , .
We introduce the vector into the design matrix as a new column (the intercept) and we can define the multicollinear relationship as:
Definition 1**.**
A set of variables has an -multicollinear relationship if for some , , we have that:
[TABLE]
The structure of this section is as follows:
We first establish the key result that connects the existence of an -multicollinear relationship (9) to the existence of an eigenvector for the matrix that has a small () eigenvalue. 2. 2.
Using the previous key result, we find multicollinear relations using information from the small eigenvalues of the matrix . We introduce the idea of a minimum-support multicollinear relationship. 3. 3.
We propose an algorithm that uses the theory from the previous steps to identify all the multicollinear relationships.
4.1 Key Result
In this section, we establish a connection between the existence of an -multicollinear relationship and the existence of a eigenvector for the matrix with a small () eigenvalue:
Theorem 2**.**
Let be the set of orthonormal eigenvectors of such that the eigenvalues associated with are less than . Then for , :
- (a)
If , then there exists a vector , such that . 2. (b)
If there exists a vector such that , then we have:
[TABLE]
where are the eigenvalues associated with the set of orthonormal eigenvectors of that have value greater or equal to .
Theorem 2 represents a weak equivalence between a small multicollinear relationship and the existence of a vector that is close to , in the sense that there exists a small vector with such that . The proof is as follows:
Proof.
- (a)
If , then every . Thus, we assume and prove part (a) by contradiction. We assume there exists no with such that . Let be the corresponding eigenvalues to eigenvectors . Note that we have , and . We write as:
[TABLE]
Letting , we have that by construction, which implies that:
[TABLE]
This implies that there exists a , such that . Now,
[TABLE]
We have
[TABLE]
Since , and , we have that , a contradiction.
- (b)
If , then . Note , since are orthonormal. Hence, for
[TABLE]
leading to We assume . We write as:
[TABLE]
and observe that:
[TABLE]
Since by assumption there exists a with and , the vector is a feasible solution to problem (10), and thus taking we have:
[TABLE]
leading to:
[TABLE]
Since
[TABLE]
Therefore, we have for all . Thus, we have:
[TABLE]
leading to as required.
∎
Theorem 2 implies that if we are able to describe , then we would be able to identify multicollinear relationships that exist in the design matrix , as Theorem 2(b) implies that every vector within distance away from represents a multicollinear relationship.
4.2 Identifying Multicollinear Relations
For , we have linearly independent multicollinear relationships. There are infinite number of ways the basis of the multicollinear relationships could be constructed, and different ways of constructing such bases lead to different constraints.
For example, assume that we have six variables , and we know that and . Letting and , we have Using Theorem 2 and ignoring as , we can identify the two multicollinear relationships as and . Then, we add the constraints
[TABLE]
to Model (1). However, there are alternative ways to characterize in terms of two linearly independent vectors. Letting and , then is also Given this representation of we would impose the constraints
[TABLE]
to Model (1). Note that the two sets of constraints are not equivalent.
It is therefore important to identify the characterization of that leads to the most stringent constraints to prevent multicollinearity. Towards this objective and ignoring the vector in Theorem 2 , we introduce the idea of identifying a vector that has minimum support. We first compute the set of orthonormal eigenvectors with corresponding eigenvalues less than . According to Theorem 2, all multicollinear relationships (up to a perturbation of ) are included in this space. Now, we want to find a vector of minimum support. This is computed as follows:
[TABLE]
Note that (4.2) can be modeled as Special Ordered Sets (SOS) of type 1, which does not need an explicit value of . In the experiments, however, we utilize the big formulation. We provide the following procedure for determining . We reformulate (4.2) to read as:
[TABLE]
Using the fact that ’s are orthonormal, we have:
[TABLE]
Taking , we can select . We note this is the tightest possible with equality at and .
Here is a positive constant that ensures that Once the vector has been identified, we add the constraint
[TABLE]
to Problem (1). To continue the process of identifying new linearly independent multicollinear relationships, we add Eq. (15) to Problem (11), resolve the problem to identify a new multicollinear relationship, add the corresponding constraint (15) to (1). We continue solving Problem (11) until the problem becomes infeasible, which means that we identified all linearly independent multicollinear relationships. Algorithm 1 determines all multicollinear relationships.
4.3 Computational Results
In this section, we use synthetic data to evaluate the performance of Algorithm 1.
We model the design matrix such that independently for each , . Then we randomly select certain number of columns to be replaced by linear combinations of other columns . The parameters are selected randomly from the uniform distribution , and we control as follows:
We first determine the number of variables we want to involve in this multicollinear relationship. 2. 2.
We randomly select numbers from without replacement, and denote that set .
We add noise according to the distribution indicated in Table 1, and evaluate the performance of Algorithm 1 on .
Algorithm 1 performance is evaluated on the accuracy and the false positive rate of the multicollinear relationships found, along with the time taken for the algorithm to converge.
In Table 1, indicates the number of multicollinear relationships involving variables that have been introduced into the data. For , we randomly selected a number within to be the number of variables involved in the multicollinear relationship. We created 10 random instances and report the average statistics across those 10 instances.
Table 1 shows that Algorithm 1 scales up to in thousands and could detect multicollinearity with high accuracy and low false positive rates.
5 Holistic Linear Regression Framework Evaluation
In this section, we combine the results of the previous two sections with the framework introduced in [1] on five different datasets randomly selected from the UCI Machine Learning Repository ([12]). We refer to our framework below as Holistic. The whole framework is:
[TABLE]
We select for the multicollinearity detection. We compare our formulation with the MISDONE formulation () as denoted in [7] and the full MIQO formulation (ignoring the alternative solution procedure) by [3]. We note here that although all of the algorithms implement subset selection and their objectives are similar (with the exception of Holistic regression having a regularization term), the algorithms differ in their formulation of the constraints. A brief table comparing the relevant constraints is presented below:
We see that compared to the Holistic framework, Tamura does not have an explicit significance constraint (though the condition number constraint to some extent helps select significant variables) and Chung’s framework does not have an explicit multicollinearity constraint, replacing it with a residual constraint.
We also used Lasso ([13]), and the framework in [1] (which we denote Bootstrap) as baselines. Samples were randomized and we utilized a for training, validation, and testing, where the validation set was used for tuning of hyperparameters. This includes the sparsity parameter in the Holistic, Bootstrap, and Chung’s framework (there is no explicit sparsity parameter in Tamura and Lasso) and in Lasso and the Holistic framework. We utilized a computer with a i7-5820k 6-core CPU and 16GB of DRAM for all our experiments. Julia 1.0 along with Gurobi 8.0 was used for the Holistic, Chung, Lasso, and the Bootstrap frameworks. Tamura’s framework is implemented using SCIP 6.0.0 along with SCIP-SDP 3.1.1 in accordance with the original paper and as Gurobi is unable to handle the semidefinite constraints in the formulation. The results are then compared across the following dimensions:
Sparsity () - Number of non-zero variables in the final selected model.
- 2.
Regression Loss (Loss) - Mean squared error on the test set.
- 3.
Significance - Percentage of non-zero coefficients in the model that are significant on the level using bootstrap to evaluate.
- 4.
Time () - Total time used by the model. Time spent detecting multicollinearity for the Holistic model is shown in brackets.
- 5.
Multicollinearity Accuracy (MA) - Let be the set as defined in Theorem 2 and be the corresponding set in the final model using . Then we calculate , the percentage of multicollinearity relations ”avoided” in the final model.
The results in Table 3 show that in real data situations, the entire framework could reasonably scale up to in and at least in , while both the MISDO and the MIQO framework scaled slower. In particular, the MISDO formulation failed to return a feasible solution for the largest datasets, and encountered numerical issues in the process (which was also identified in the original paper). The MISDO formulation also does not consider significance constraints, which was reflected in that some of the variables it selected were insignificant. The MIQO formulation scaled better compared to the MISDO formulation, but was still over x slower than holistic regression in all datasets. Compared with the holistic formulation which only has one set of lazy constraints based on significance, Chung’s MIQO formulation has two (significance and residual plots), and we conjecture such additional lazy constriants make it easier for the incumbent solution to be rejected and causes more lazy constraints to be enforced, slowing the runtime. Furthermore, the MIQO formulation does not explicitly model multicollinearity, and this meant the final model in the Parkinsons and the Crime dataset did not avoid more than of multicollinear relationships. In comparison, the holistic regression successfully detected all multicollinear relationships within the data and avoided choosing all variables within that relationship in the final result.We further conjecture that such explicit modeling of significance and multicollinearity constraints is also why the holistic framework usually selects the smallest number of variables, as it needs to satisfy more constraints on subset selection .
Compared to the Lasso baseline, the holistic framework achieved comparable loss with Lasso among most tasks, while using many fewer variables to do so (usually less than half), and all of the selected variables from the framework are significant at the level. The original bootstrap method proposed in [1] quickly timed out as increased, resulting in suboptimal performance.
The computational results suggest that the proposed holistic linear regression algorithm greatly increases its scalability in detecting significance and avoiding multicollinearity. Using both real and synthetic data, we have demonstrated that the approach produces high quality linear regression models in realistic timelines.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] D. Bertsimas and A. King, “An algorithmic approach to linear regression,” Operations Research , vol. 64, no. 1, pp. 2–16, 2016.
- 2[2] E. Carrizosaa, A. V. Olivares-Nadal, and P. Ramırez-Cobob, “Enhancing interpretability by tightening linear regression models,” tech. rep., Technical report, 2017.
- 3[3] S. Chung, Y. W. Park, and T. Cheong, “A mathematical programming approach for integrated multiple linear regression subset selection and validation,” ar Xiv preprint ar Xiv:1712.04543 , 2017.
- 4[4] R. R. Hocking, “A biometrics invited paper. the analysis and selection of variables in linear regression,” Biometrics , vol. 32, no. 1, pp. 1–49, 1976.
- 5[5] E. R. Mansfield and B. P. Helms, “Detecting multicollinearity,” The American Statistician , vol. 36, no. 3a, pp. 158–160, 1982.
- 6[6] R. Tamura, K. Kobayashi, Y. Takano, R. Miyashiro, K. Nakata, and T. Matsui, “Mixed integer quadratic optimization formulations for eliminating multicollinearity based on variance inflation factor. optimization online,” Optimization Online , 2016.
- 7[7] R. Tamura, K. Kobayashi, Y. Takano, R. Miyashiro, K. Nakata, and T. Matsui, “Best subset selection for eliminating multicollinearity,” Journal of the Operations Research Society of Japan , vol. 60, no. 3, pp. 321–336, 2017.
- 8[8] R. M. O’brien, “A caution regarding rules of thumb for variance inflation factors,” Quality & quantity , vol. 41, no. 5, pp. 673–690, 2007.
