Scalable Holistic Linear Regression

Dimitris Bertsimas; Michael Lingzhi Li

arXiv:1902.03272·stat.ML·March 5, 2020

Scalable Holistic Linear Regression

Dimitris Bertsimas, Michael Lingzhi Li

PDF

TL;DR

This paper introduces a scalable holistic linear regression algorithm that models significance and multicollinearity as lazy constraints, significantly improving scalability and accuracy over previous methods.

Contribution

The paper presents a novel theory and algorithm that enhance scalability and performance of holistic linear regression by modeling key conditions as lazy constraints.

Findings

01

Scales with thousands of samples, outperforming previous methods.

02

Improves accuracy and reduces false detection rate.

03

Reduces computational time significantly.

Abstract

We propose a new scalable algorithm for holistic linear regression building on Bertsimas & King (2016). Specifically, we develop new theory to model significance and multicollinearity as lazy constraints rather than checking the conditions iteratively. The resulting algorithm scales with the number of samples $n$ in the 10,000s, compared to the low 100s in the previous framework. Computational results on real and synthetic datasets show it greatly improves from previous algorithms in accuracy, false detection rate, computational time and scalability.

Tables3

Table 1. Table 1 : Performance of Algorithm 1 for multicollinearity detection.

$𝒏$	$𝒑$	$𝑴 𝑹 (𝟑)$	$𝑴 𝑹 (𝟒)$	$𝑴 𝑹 (𝟒 +)$	Noise	ACC	FPR	Time
1000	100	3	1	1	$N (0, 0.01)$	$100 %$	$0 %$	0.27s
1000	500	3	1	1	$N (0, 0.01)$	$100 %$	$0 %$	2.37s
1000	1000	3	1	1	$N (0, 0.01)$	$100 %$	$5 %$	20.23s
1000	500	5	3	2	$N (0, 0.01)$	$100 %$	$0 %$	33.40s
1000	1000	5	3	2	$N (0, 0.01)$	$100 %$	$24 %$	5940.56s
1000	500	3	1	1	$N (0, 0.03)$	$100 %$	$0 %$	2.29s
1000	1000	3	1	1	$N (0, 0.03)$	$100 %$	$11 %$	32.17s

Table 2. Table 2 : Comparison of Constraints within the Holistic, [ 7 ] , and [ 3 ] frameworks

Constraint Type	Holistic	[7]	[3]
Subset Selection	Big- $M$	Condition Number-based	Big- $M$
Significance	Normality-based	None	$t$ -test based
Multicollinearity	Explicit	Condition Number-based	None
Residuals	None	None	Absolute & Breusch-Pagan test

Table 3. Table 3 : Comparison of the holistic framework with Lasso for five real world data sets. - means that there are no true multicollinear relationships in the data. N/A means the algorithm did not return a feasible solution within 60000s.

Dataset	$𝒏$	$𝒑$	Method	$𝒌$	Loss	Sign.	$𝑻$	MA
Airfoil	1502	5	Holistic	3	558	$100 %$	50s	-
			Tamura	4	562	$75 %$	19s	-
			Chung	3	558	$100 %$	107s	-
			Lasso	4	570	$75 %$	7s	-
			Bootstrap	4	564	$75 %$	39870s	-
Cancer	568	29	Holistic	7	1.71	$100 %$	54s	-
			Tamura	11	1.90	$63 %$	410s	-
			Chung	7	1.71	$100 %$	310s	-
			Lasso	23	0.72	$31 %$	10s	-
			Bootstrap	12	2.22	$60 %$	60000s	-
Parkinsons	5875	16	Holistic	1	533	$100 %$	403s (15.2s)	$100 %$
			Tamura	1	533	$100 %$	$60000 s$	$100 %$
			Chung	3	549	$100 %$	876s	$50 %$
			Lasso	3	522	$33 %$	14s	$50 %$
			Bootstrap	3	571	$33 %$	60000s	$50 %$
Air Quality	9358	12	Holistic	4	89.2	$100 %$	380s	-
			Tamura	N/A	N/A	N/A	N/A	-
			Chung	6	96.1	$100 %$	770s	-
			Lasso	9	83.7	$33 %$	11s	-
			Bootstrap	5	89.6	$80 %$	58146s	-
Crime	2215	125	Holistic	9	180	$100 %$	725s (120.3s)	$100 %$
			Tamura	N/A	N/A	N/A	N/A	N/A
			Chung	11	207	$100 %$	1506s	$40 %$
			Lasso	19	172	$47 %$	21s	$20 %$
			Bootstrap	12	195	$74 %$	60000s	$20 %$

Equations71

β, z min

β, z min

subject to :

i = 1 \sum p z_{i} \leq k

z_{i} = z_{j}, \forall i, j \in G S_{m}, \forall m

z_{i} + z_{j} \leq 1, \forall i, j \in H C

H C = {(i, j) : ∣Corr (x_{i}, x_{j}) ∣ \geq ρ}

H C = {(i, j) : ∣Corr (x_{i}, x_{j}) ∣ \geq ρ}

i \in S \sum z_{i} \leq ∣ S ∣ - 1

i \in S \sum z_{i} \leq ∣ S ∣ - 1

Y = X β + ϵ .

Y = X β + ϵ .

\frac{n ( β ^ - β )}{σ Q} d N (0, 1)

\frac{n ( β ^ - β )}{σ Q} d N (0, 1)

\hat{β} = (X^{T} X)^{- 1} X^{T} Y .

\hat{β} = (X^{T} X)^{- 1} X^{T} Y .

\frac{β ^ _{j} - β _{j}}{σ ~ K _{j j}^{- 1}} \sim N (0, 1) .

\frac{β ^ _{j} - β _{j}}{σ ~ K _{j j}^{- 1}} \sim N (0, 1) .

\tilde{σ} = \frac{Y ^{T} ( I _{n} - X ( X ^{T} X ) ^{- 1} X ^{T} ) Y}{n - p}

\tilde{σ} = \frac{Y ^{T} ( I _{n} - X ( X ^{T} X ) ^{- 1} X ^{T} ) Y}{n - p}

\frac{∣ β _{j} ∣}{σ ~ ( X _{z}^{T} X _{z} ) _{j j}^{- 1}} \geq N_{s i g n} z_{j},

\frac{∣ β _{j} ∣}{σ ~ ( X _{z}^{T} X _{z} ) _{j j}^{- 1}} \geq N_{s i g n} z_{j},

\frac{β _{j}}{σ ~ ( X _{z}^{T} X _{z} ) _{j j}^{- 1}} + M b_{j}

\frac{β _{j}}{σ ~ ( X _{z}^{T} X _{z} ) _{j j}^{- 1}} + M b_{j}

- \frac{β _{j}}{σ ~ ( X _{z}^{T} X _{z} ) _{j j}^{- 1}} + M (1 - b_{j})

b_{j} \in {0, 1},

\left\|\sum_{j=1}^{n}a_{j}\mathbf{X}_{j}\right\|<\epsilon.\

\left\|\sum_{j=1}^{n}a_{j}\mathbf{X}_{j}\right\|<\epsilon.\

j = 1 \sum p a_{j} X_{j} < (1 + λ_{m + 1} + \dots + λ_{p}) ϵ,

j = 1 \sum p a_{j} X_{j} < (1 + λ_{m + 1} + \dots + λ_{p}) ϵ,

a = α_{1} v_{1} + \dots + α_{p} v_{p} .

a = α_{1} v_{1} + \dots + α_{p} v_{p} .

∥ b ∥ = ∥ α_{m + 1} v_{m + 1} + \dots + α_{p} v_{p} ∥ \geq (p - m) ϵ .

∥ b ∥ = ∥ α_{m + 1} v_{m + 1} + \dots + α_{p} v_{p} ∥ \geq (p - m) ϵ .

j = 1 \sum p a_{i} X_{j} = ∥ X a ∥ = ∥ α_{1} X v_{1} + \dots + α_{p} X v_{p} ∥ < ϵ .

j = 1 \sum p a_{i} X_{j} = ∥ X a ∥ = ∥ α_{1} X v_{1} + \dots + α_{p} X v_{p} ∥ < ϵ .

ϵ^{2}

ϵ^{2}

= (α_{1} X v_{1} + \dots + α_{p} X v_{p})^{T} (α_{1} X v_{1} + \dots + α_{p} X v_{p})

= α_{1}^{2} λ_{1} + \dots + α_{p}^{2} λ_{p}

\geq α_{j_{0}}^{2} λ_{j_{0}} .

\begin{array}[]{rcl}\|\bm{X}\mathbf{a}\|^{2}&=&\mathbf{a}^{T}\bm{X}^{T}\bm{X}\mathbf{a}\\ &=&\displaystyle\left(\sum_{j=1}^{p}a_{j}\mathbf{v}_{j}\right)^{T}\bm{X}^{T}\bm{X}\left(\sum_{j=1}^{p}a_{j}\mathbf{v}_{j}\right)\\ &=&\displaystyle\left(\sum_{j=1}^{p}a_{j}\mathbf{v}_{j}\right)^{T}\left(\sum_{j=1}^{p}a_{j}\lambda_{j}\mathbf{v}_{j}\right)\\ &=&\displaystyle\sum_{j=1}^{p}\lambda_{j}a_{j}^{2}\\ &<&\epsilon\|\mathbf{a}\|^{2}=\epsilon.\end{array}

\begin{array}[]{rcl}\|\bm{X}\mathbf{a}\|^{2}&=&\mathbf{a}^{T}\bm{X}^{T}\bm{X}\mathbf{a}\\ &=&\displaystyle\left(\sum_{j=1}^{p}a_{j}\mathbf{v}_{j}\right)^{T}\bm{X}^{T}\bm{X}\left(\sum_{j=1}^{p}a_{j}\mathbf{v}_{j}\right)\\ &=&\displaystyle\left(\sum_{j=1}^{p}a_{j}\mathbf{v}_{j}\right)^{T}\left(\sum_{j=1}^{p}a_{j}\lambda_{j}\mathbf{v}_{j}\right)\\ &=&\displaystyle\sum_{j=1}^{p}\lambda_{j}a_{j}^{2}\\ &<&\epsilon\|\mathbf{a}\|^{2}=\epsilon.\end{array}

a = α_{1} v_{1} + \dots + α_{p} v_{p}

a = α_{1} v_{1} + \dots + α_{p} v_{p}

u \in Span (V) min ∥ a - u ∥ = ∥ α_{m + 1} v_{m + 1} + \dots + α_{p} v_{p} ∥.

u \in Span (V) min ∥ a - u ∥ = ∥ α_{m + 1} v_{m + 1} + \dots + α_{p} v_{p} ∥.

∥ a - (a - b) ∥ = ∥ b ∥ \geq ∥ α_{m + 1} v_{m + 1} + \dots + α_{p} v_{p} ∥,

∥ a - (a - b) ∥ = ∥ b ∥ \geq ∥ α_{m + 1} v_{m + 1} + \dots + α_{p} v_{p} ∥,

∥ α_{m + 1} v_{m + 1} + \dots + α_{p} v_{p} ∥ < ϵ .

∥ α_{m + 1} v_{m + 1} + \dots + α_{p} v_{p} ∥ < ϵ .

∥ α_{m + 1} v_{m + 1} + \dots + α_{p} v_{p} ∥^{2} = j = m + 1 \sum p α_{j}^{2} < ϵ,

∥ α_{m + 1} v_{m + 1} + \dots + α_{p} v_{p} ∥^{2} = j = m + 1 \sum p α_{j}^{2} < ϵ,

a^{T} X^{T} X a

a^{T} X^{T} X a

z_{1} + z_{2} + z_{3} \leq 2, z_{4} + z_{5} + z_{6} \leq 2

z_{1} + z_{2} + z_{3} \leq 2, z_{4} + z_{5} + z_{6} \leq 2

z_{1} + z_{2} + z_{3} + z_{4} + z_{5} + z_{6} \leq 4

z_{1} + z_{2} + z_{3} + z_{4} + z_{5} + z_{6} \leq 4

min

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Regression

Full text

Scalable Holistic Linear Regression

Dimitris Bertsimas

Sloan School of Management and Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA 02139

Michael Lingzhi Li

Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA 02139

Abstract

We propose a new scalable algorithm for holistic linear regression building on Bertsimas & King (2016). Specifically, we develop new theory to model significance and multicollinearity as lazy constraints rather than checking the conditions iteratively. The resulting algorithm scales with the number of samples $n$ in the 10,000s, compared to the low 100s in the previous framework. Computational results on real and synthetic datasets show it greatly improves from previous algorithms in accuracy, false detection rate, computational time and scalability.

keywords:

Holistic Linear Regression , Multicollinearity and Significance in Linear Regression , Mixed-Integer Optimization

††journal: Operation Research Letters

1 Introduction

In this paper, we continue the research program initiated in [1] to develop an algorithmic approach for holistic linear regression in which we impose desirable properties simultaneously and a priori. Using mixed integer optimization (MIO) the earlier proposal modeled sparsity, pairwise collinearity and group sparsity using explicit constraints but accounted for significance and multicollinearity through a cutting plane and bootstrap approach. The difficulty of using the cutting plane method is that it often requires a large number of iterations to ensure that the model has appropriate significance and does not exhibit multicollinearity. This results in an algorithm that does not scale beyond $n$ , the number of samples, in the low 100s when accounting for significance and multicollinearity.

Our goal in this paper is to propose a new scalable algorithm for holistic linear regression. We propose a new way to impose significance and multicollinearity constraints explicitly that scales with $n$ in the 10,000s. This allows us to build linear regression models much more effectively and accurately than in earlier works.

In our view, scalable holistic regression is important at it allows linear regression models to have interpretability, robustness, significance and accuracy in a systematic way. In contrast, today the practice of regression is more of an art than science. Continuing on the vision in [1], the paper aspires to scale holistic regression further and make these methods easier to use in much larger problems.

The standard methodology for imposing significance in linear regression is to use the Student $t$ -statistic. However, the test is carried out after the linear regression model has been calculated, and does not optimally select a subset of covariates that are significant a priori. In the [1] framework, summarized in Section 2, significance is imposed iteratively leading to a cutting plane algorithm. [2] explored significance of coefficients by adding heuristic constraints to set lower bounds on the coefficients. [3] used lazy constraints to ensure exact significance tests while deriving theoretical bounds for minimum power. In contrast to [2] and similar to [3], we use lazy constraints to ensure minimum power. However, instead of using the $t$ -statistic, we appeal to the asymptotic normality results instead.

For multicollinearity, in a landmark paper [4] comprehensively reviewed the problem and concluded that there is no accepted way of dealing with this problem, citing “there is a lack of attention for this problem in the statistics community.” Various methods employed include principal component analysis to select the top $k$ variables to avoid multicollinear combinations, and variance inflation factors [5] that provide a numerical quantity to determine how much the variance of a coefficient has been increased due to correlation with other variables. [6, 7] explored incorporating multicollinearity constraints using variance inflation factors (VIFs) and condition numbers (CNs) respectively. However, both of these concepts are only approximations of true multicollinear relationships. It is true that multicollinear relationships are sufficient for high VIF and CN, but they are not necessary, as shown in [8] and [9], respectively. That means constraining on VIF or CN would potentially produce extra constraints that are not needed for solving multicollinearity. In this paper, we introduce new theory that provides both necessary and sufficient guarantees in relation to detecting multicollinearity.

Specifically, our contributions in this paper are as follows:

We continue the program in [1] and extend the formulation with significance constraints a priori. 2. 2.

We develop a new theory of detecting multicollinearity by connecting multicollinearity to the eigenvectors of the design matrix $\bm{X}^{T}\bm{X}$ , where $\bm{X}$ is the $n\times p$ matrix of the given data and use it to impose multicollinearity constraints within an MIO framework 3. 3.

We present computational results on real and synthetic datasets that suggest the overall algorithm for holistic regression scales with $n$ in the 10,000s, while the method in [1] scales with $n$ in the low 100s when accounting for significance and multicollinearity.

The structure of the paper is as follows. In Section 2, we review the work in [1] on constructing a holistic framework for linear regression. In Section 3, we introduce the $t$ -statistic formulation to model significance. In Section 4, we introduce a new formulation to model multicollinearity and present computational results with synthetic and real-world data that show the effectiveness of the method. In Section 5, we combine our proposals with the framework introduced in [1], and compare its performance with recent models in the literature using real-world data.

2 The Framework of [1]

Given data $(x_{i},y_{i})$ , $i=1,\cdots,n$ , $x_{i}\in\mathbb{R}^{p}$ , $y_{i}\in\mathbb{R}$ , [1] propose the following MIO:

[TABLE]

The term $\Gamma\|{\bm{\beta}}\|_{1}$ in the objective function (1) models robustness as seen in [10], who established the equivalence of the $\ell_{1}$ penalization and robustness.

Constraints (2) and (3) model sparsity using the Big- $M$ framework that at most $k$ out of the $p$ variables are selected in the linear regression model. In this paper, the specification of $M$ follows from that in [1]. Constraint (4) models group sparsity, i.e., variables in the set $GS_{m}$ are either all selected or none is selected. Finally, pairwise collinearity is modeled in Constraint (2) where HC is the set

[TABLE]

for some predefined correlation $\rho$ cutoff.

[1] apply the following iterative process to include constraints for significance and multicollinearity:

Solve the MIO (1)-(2) to obtain a subset $S$ of the coefficients $\{\beta_{1},\cdots,\beta_{p}\}$ . 2. 2.

For the set $S$ the algorithm computes the significance levels for each of the variables via bootstrap methods, and calculates the condition number of the model. If a set $S$ produces undesirable results – a condition number higher than desired, or a model with insignificant variables – the algorithm generates the constraint

[TABLE]

to exclude the set $S$ from consideration. The algorithm adds the constraint to Problem (1) and repeats the process until no such set $S$ is found.

[1] report computational results that demonstrate that Model (1) is effective to solve problems up to $n,p$ in the 1000s. However, when we include significance and multicollinearity constraints in a cutting plane methodology, the method scales up to $n,p$ in the low 100s and some times no solution is found after considerable computation time.

3 Imposing Significance Constraints

Variable significance has long been one of the most important elements in linear regression, and has served as a proxy for variable selection and causality studies.

We first restate a standard result about the asymptotic guarantee of the normality of the least squares estimate of ${\bm{\beta}}$ to serve as the basis of our approach. For a linear regression problem:

[TABLE]

We have the following theorem, as proven in [11]:

Theorem 1.

If ${\bm{\epsilon}}$ is iid with $\mathbb{E}[\epsilon_{i}]=0$ and $\mathbb{E}[\epsilon_{i}^{2}]=\sigma^{2}$ for all $i$ , and $\lim_{n\to\infty}\frac{\mathbf{X}^{T}\mathbf{X}}{n}=\mathbf{Q}$ is invertible, then we have:

[TABLE]

Where $\hat{{\bm{\beta}}}$ is the least squares estimate of ${\bm{\beta}}$ with

[TABLE]

Note that normality is not part of the assumption here - in contrast to the $t$ -test statistics used in [3]. Therefore, using such asymptotic results, we would assume when $n$ large enough, we have that:

[TABLE]

Where $\bm{K}=(\bm{X}^{T}\bm{X})^{-1}$ and

[TABLE]

is the least squares estimate of standard deviation $\sigma$ .

3.1 Constructing Significance Constraints

For a test of size $\alpha$ , we first define the quantity $N_{sign}=\Phi^{-1}(1-\frac{\alpha}{2})$ , the inverse cdf of the $N(0,1)$ distribution at point $1-\frac{\alpha}{2}$ . Then, we can impose the normality test by requiring:

[TABLE]

Where $\bm{X}_{\bm{z}}$ is the model matrix constrained to the columns where $z_{i}=1$ . This is equivalent to the big $M$ -constraints:

[TABLE]

where $M$ is a large constant. These two constraints are used to model significance of level $\alpha$ without the need of the bootstrap. As the model matrix $\bm{X}_{\bm{z}}$ changes with the selection of $\bm{z}$ , in implementation these constraints are implemented as lazy constraints to only be enforced when a feasible integer solution is reached, in a similar fashion to [3].

In interest of brevity, we defer computational experiments and present results when combined with the multicollinearity detection as illustrated below.

4 Multicollinearity Detection

Given data $X$ , we would like the design matrix to be free of multicollinear relationships so that $\det(\bm{X}^{T}\bm{X})$ is not very close to 0. We denote the columns of $\bm{X}$ as $\mathbf{X}_{j}$ , $j=1,\cdots,p$ .

We introduce the vector $(1,\cdots,1)^{T}$ into the design matrix as a new column (the intercept) and we can define the multicollinear relationship as:

Definition 1.

A set of variables $\mathbf{X}_{1},\cdots\mathbf{X}_{p}$ has an $\epsilon$ -multicollinear relationship if for some $\mathbf{a}\in\mathbb{R}^{p}$ , $\|\mathbf{a}\|=1$ , we have that:

[TABLE]

The structure of this section is as follows:

We first establish the key result that connects the existence of an $\epsilon$ -multicollinear relationship (9) to the existence of an eigenvector $\mathbf{v}$ for the matrix $\bm{X}^{T}\bm{X}$ that has a small ( $O(\sqrt{\epsilon})$ ) eigenvalue. 2. 2.

Using the previous key result, we find multicollinear relations $(\mathbf{a}=(a_{1},\cdots,a_{p}))$ using information from the small eigenvalues of the matrix $\bm{X}^{T}\bm{X}$ . We introduce the idea of a minimum-support multicollinear relationship. 3. 3.

We propose an algorithm that uses the theory from the previous steps to identify all the multicollinear relationships.

4.1 Key Result

In this section, we establish a connection between the existence of an $\epsilon$ -multicollinear relationship and the existence of a eigenvector $\mathbf{v}$ for the matrix $\bm{X}^{T}\bm{X}$ with a small ( $O(\sqrt{\epsilon})$ ) eigenvalue:

Theorem 2.

Let $V=\{\mathbf{v}_{1},\ldots,\mathbf{v}_{m}\}$ be the set of orthonormal eigenvectors of $\bm{X}^{T}\bm{X}\in\mathbb{R}^{p\times p}$ such that the eigenvalues associated with $V$ are less than $\epsilon$ . Then for $\mathbf{a}\in\mathbb{R}^{p}$ , $\|\mathbf{a}\|=1$ :

(a)

If $\left\|\sum_{j=1}^{p}a_{j}\mathbf{X}_{j}\right\|<\epsilon$ , then there exists a vector $\mathbf{b}\in\mathbb{R}^{p}$ , $\|\mathbf{b}\|<(p-m)\sqrt{\epsilon}$ such that $\mathbf{a}-\mathbf{b}\in\operatorname{Span}(V)$ . 2. (b)

If there exists a vector $\mathbf{b}\in\mathbb{R}^{p},\|\mathbf{b}\|<\sqrt{\epsilon}$ such that $\mathbf{a}-\mathbf{b}\in\operatorname{Span}(V)$ , then we have:

[TABLE]

where $\lambda_{m+1},\ldots,\lambda_{p}$ are the eigenvalues associated with the set of orthonormal eigenvectors of $M$ that have value greater or equal to $\epsilon$ .

Theorem 2 represents a weak equivalence between a small multicollinear relationship and the existence of a vector $\mathbf{a}$ that is close to $\operatorname{Span}(V)$ , in the sense that there exists a small vector $\mathbf{b}$ with $\|\mathbf{b}\|<O(\sqrt{\epsilon})$ such that $\mathbf{a}-\mathbf{b}\in\operatorname{Span}(V)$ . The proof is as follows:

Proof.

(a)

If $m=p$ , then every $\mathbf{a}\in\operatorname{Span}(V)$ . Thus, we assume $m<p$ and prove part (a) by contradiction. We assume there exists no $\mathbf{b}\in\mathbb{R}^{p}$ with $\|\mathbf{b}\|<(p-m)\sqrt{\epsilon}$ such that $\mathbf{a}-\mathbf{b}\in\operatorname{Span}(V)$ . Let $\lambda_{1},\ldots,\lambda_{p}$ be the corresponding eigenvalues to eigenvectors $\mathbf{v}_{1},\ldots,\mathbf{v}_{p}$ . Note that we have $0\leq\lambda_{1},\ldots,\lambda_{m}<\epsilon$ , and $\epsilon\leq\lambda_{m+1},\ldots,\lambda_{p}$ . We write $\mathbf{a}$ as:

[TABLE]

Letting $\mathbf{b}=\alpha_{m+1}\mathbf{v}_{m+1}+\ldots+\alpha_{p}\mathbf{v}_{p}$ , we have that $\mathbf{a}-\mathbf{b}\in\operatorname{Span}(V)$ by construction, which implies that:

[TABLE]

This implies that there exists a $\alpha_{j_{0}}$ , $j_{0}\in\{m+1,\cdots,p\}$ such that $\|\alpha_{j_{0}}\|\geq\sqrt{\epsilon}$ . Now,

[TABLE]

We have

[TABLE]

Since $|\alpha_{j_{0}}|\geq\sqrt{\epsilon}$ , and $\lambda_{j_{0}}\geq\epsilon$ , we have that $\epsilon^{2}>\alpha_{j_{0}}^{2}\lambda_{j_{0}}\geq\epsilon^{2}$ , a contradiction.

(b)

If $m=p$ , then $\mathbf{a}=\sum_{j=1}^{p}a_{j}\mathbf{v}_{j}$ . Note $\|\mathbf{a}\|^{2}=\sum_{j=1}^{p}a_{j}^{2}$ , since $\mathbf{v}_{j}$ are orthonormal. Hence, for $\|\mathbf{a}\|=1$

[TABLE]

leading to $\|\bm{X}\mathbf{a}\|<\sqrt{\epsilon}.$ We assume $m<p$ . We write $\mathbf{a}$ as:

[TABLE]

and observe that:

[TABLE]

Since by assumption there exists a $\mathbf{b}$ with $\|\mathbf{b}\|<\sqrt{\epsilon}$ and $\mathbf{a}-\mathbf{b}\in\operatorname{Span}(V)$ , the vector $\mathbf{a}-\mathbf{b}$ is a feasible solution to problem (10), and thus taking $\mathbf{u}=\mathbf{a}-\mathbf{b}$ we have:

[TABLE]

leading to:

[TABLE]

Since

[TABLE]

Therefore, we have $|\alpha_{j}|<\sqrt{\epsilon}$ for all $j=m+1,\ldots,p$ . Thus, we have:

[TABLE]

leading to $\|\bm{X}\mathbf{a}\|\leq\sqrt{(1+\lambda_{m+1}+\cdots+\lambda_{p})\epsilon}$ as required.

∎

Theorem 2 implies that if we are able to describe $\operatorname{Span}(V)$ , then we would be able to identify multicollinear relationships $\mathbf{a}$ that exist in the design matrix $\bm{X}$ , as Theorem 2(b) implies that every vector within $\sqrt{\epsilon}$ distance away from $\operatorname{Span}(V)$ represents a $O(\sqrt{\epsilon})$ multicollinear relationship.

4.2 Identifying Multicollinear Relations

For $\dim(V)=r$ , we have $r-1$ linearly independent multicollinear relationships. There are infinite number of ways the basis of the $r-1$ multicollinear relationships could be constructed, and different ways of constructing such bases lead to different constraints.

For example, assume that we have six variables $x_{1},x_{2},x_{3},x_{4},x_{5},x_{6}$ , and we know that $x_{1}+x_{2}=x_{3}$ and $x_{4}+x_{5}=x_{6}$ . Letting $\mathbf{a}_{1}=(1,1,-1,0,0,0)^{T}$ and $\mathbf{a}=(0,0,0,1,1,-1)^{T}$ , we have $V=\operatorname{Span}(\mathbf{a}_{1},\mathbf{a}).$ Using Theorem 2 and ignoring $\mathbf{b}$ as $\|\mathbf{b}\|=O(\sqrt{\epsilon})$ , we can identify the two multicollinear relationships as $\mathbf{a}_{1}$ and $\mathbf{a}$ . Then, we add the constraints

[TABLE]

to Model (1). However, there are alternative ways to characterize $V$ in terms of two linearly independent vectors. Letting $\overline{\mathbf{a}}_{1}=(1,1,-1,1,1,-1)^{T}$ and $\overline{\mathbf{a}}=(1,1,-1,-1,-1,1)^{T}$ , then $V$ is also $V=\operatorname{Span}(\overline{\mathbf{a}}_{1},\overline{\mathbf{a}}).$ Given this representation of $V$ we would impose the constraints

[TABLE]

to Model (1). Note that the two sets of constraints are not equivalent.

It is therefore important to identify the characterization of $V$ that leads to the most stringent constraints to prevent multicollinearity. Towards this objective and ignoring the vector $\mathbf{b}$ in Theorem 2 , we introduce the idea of identifying a vector $\mathbf{a}\in\operatorname{Span}(V)$ that has minimum support. We first compute the set $V=\{\mathbf{v}_{1},\ldots,\mathbf{v}_{m}\}$ of orthonormal eigenvectors with corresponding eigenvalues less than $\epsilon$ . According to Theorem 2, all multicollinear relationships (up to a perturbation of $\epsilon$ ) are included in this space. Now, we want to find a vector $\mathbf{a}\in\operatorname{Span}(V)$ of minimum support. This is computed as follows:

[TABLE]

Note that (4.2) can be modeled as Special Ordered Sets (SOS) of type 1, which does not need an explicit value of $M$ . In the experiments, however, we utilize the big $M$ formulation. We provide the following procedure for determining $M$ . We reformulate (4.2) to read as:

[TABLE]

Using the fact that $\bm{v}_{i}$ ’s are orthonormal, we have:

[TABLE]

Taking $\|\bm{\theta}\|_{2}=1$ , we can select $\displaystyle M=\frac{\sqrt{m}}{m}=\frac{1}{\sqrt{m}}$ . We note this is the tightest possible $M$ with equality at $\bm{v}_{i}=\bm{e}_{i}$ and $\bm{\theta}=(\frac{1}{\sqrt{m}},\cdots,\frac{1}{\sqrt{m}})$ .

Here $\delta$ is a positive constant that ensures that $\mathbf{a}\neq 0.$ Once the vector $\mathbf{a}$ has been identified, we add the constraint

[TABLE]

to Problem (1). To continue the process of identifying new linearly independent multicollinear relationships, we add Eq. (15) to Problem (11), resolve the problem to identify a new multicollinear relationship, add the corresponding constraint (15) to (1). We continue solving Problem (11) until the problem becomes infeasible, which means that we identified all linearly independent multicollinear relationships. Algorithm 1 determines all multicollinear relationships.

4.3 Computational Results

In this section, we use synthetic data to evaluate the performance of Algorithm 1.

We model the design matrix $\bm{X}\in\mathbb{R}^{n\times p}$ such that $X_{ij}\sim N(0,1)$ independently for each $i\in\{1,\cdots,n\}$ , $j\in\{1,\cdots,p\}$ . Then we randomly select certain number of columns to be replaced by linear combinations of other columns $\sum_{ij\in S}\gamma_{j}\bm{X}_{j}$ . The parameters $\gamma_{i}$ are selected randomly from the uniform distribution $[-10,10]$ , and we control $S$ as follows:

We first determine the number $q$ of variables we want to involve in this multicollinear relationship. 2. 2.

We randomly select $q$ numbers from $\{1,\ldots,p\}$ without replacement, and denote that set $S$ .

We add noise $\tilde{\bm{X}}$ according to the distribution indicated in Table 1, and evaluate the performance of Algorithm 1 on $\bm{X}+\tilde{\bm{X}}$ .

Algorithm 1 performance is evaluated on the accuracy and the false positive rate of the multicollinear relationships found, along with the time taken for the algorithm to converge.

In Table 1, $MR(q)$ indicates the number of multicollinear relationships involving $q$ variables that have been introduced into the data. For $MR(4+)$ , we randomly selected a number within $\{5,$ $6,$ $7,$ $8,$ $9,$ $10\}$ to be the number of variables involved in the multicollinear relationship. We created 10 random instances and report the average statistics across those 10 instances.

Table 1 shows that Algorithm 1 scales up to $n,p$ in thousands and could detect multicollinearity with high accuracy and low false positive rates.

5 Holistic Linear Regression Framework Evaluation

In this section, we combine the results of the previous two sections with the framework introduced in [1] on five different datasets randomly selected from the UCI Machine Learning Repository ([12]). We refer to our framework below as Holistic. The whole framework is:

[TABLE]

We select $\delta=10^{-6}$ for the multicollinearity detection. We compare our formulation with the MISDONE formulation ( $\kappa=100$ ) as denoted in [7] and the full MIQO formulation (ignoring the alternative solution procedure) by [3]. We note here that although all of the algorithms implement subset selection and their objectives are similar (with the exception of Holistic regression having a $l1$ regularization term), the algorithms differ in their formulation of the constraints. A brief table comparing the relevant constraints is presented below:

We see that compared to the Holistic framework, Tamura does not have an explicit significance constraint (though the condition number constraint to some extent helps select significant variables) and Chung’s framework does not have an explicit multicollinearity constraint, replacing it with a residual constraint.

We also used Lasso ([13]), and the framework in [1] (which we denote Bootstrap) as baselines. Samples were randomized and we utilized a $60\%/20\%/20\%$ for training, validation, and testing, where the validation set was used for tuning of hyperparameters. This includes the sparsity parameter $k$ in the Holistic, Bootstrap, and Chung’s framework (there is no explicit sparsity parameter in Tamura and Lasso) and $\Gamma$ in Lasso and the Holistic framework. We utilized a computer with a i7-5820k 6-core CPU and 16GB of DRAM for all our experiments. Julia 1.0 along with Gurobi 8.0 was used for the Holistic, Chung, Lasso, and the Bootstrap frameworks. Tamura’s framework is implemented using SCIP 6.0.0 along with SCIP-SDP 3.1.1 in accordance with the original paper and as Gurobi is unable to handle the semidefinite constraints in the formulation. The results are then compared across the following dimensions:

Sparsity ( $k$ ) - Number of non-zero variables in the final selected model.

2.

Regression Loss (Loss) - Mean squared error on the test set.

3.

Significance - Percentage of non-zero coefficients in the model that are significant on the $5\%$ level using bootstrap to evaluate.

4.

Time ( $T$ ) - Total time used by the model. Time spent detecting multicollinearity for the Holistic model is shown in brackets.

5.

Multicollinearity Accuracy (MA) - Let $V$ be the set as defined in Theorem 2 and $V_{\bm{z}}$ be the corresponding set in the final model using $\epsilon=10^{-2}$ . Then we calculate $100\%\times\left(1-\frac{\dim(V_{\bm{z}})}{\dim(V)}\right)$ , the percentage of multicollinearity relations ”avoided” in the final model.

The results in Table 3 show that in real data situations, the entire framework could reasonably scale up to $~{}10,000$ in $n$ and at least $100$ in $p$ , while both the MISDO and the MIQO framework scaled slower. In particular, the MISDO formulation failed to return a feasible solution for the largest datasets, and encountered numerical issues in the process (which was also identified in the original paper). The MISDO formulation also does not consider significance constraints, which was reflected in that some of the variables it selected were insignificant. The MIQO formulation scaled better compared to the MISDO formulation, but was still over $2$ x slower than holistic regression in all datasets. Compared with the holistic formulation which only has one set of lazy constraints based on significance, Chung’s MIQO formulation has two (significance and residual plots), and we conjecture such additional lazy constriants make it easier for the incumbent solution to be rejected and causes more lazy constraints to be enforced, slowing the runtime. Furthermore, the MIQO formulation does not explicitly model multicollinearity, and this meant the final model in the Parkinsons and the Crime dataset did not avoid more than $50\%$ of multicollinear relationships. In comparison, the holistic regression successfully detected $100\%$ all multicollinear relationships within the data and avoided choosing all variables within that relationship in the final result.We further conjecture that such explicit modeling of significance and multicollinearity constraints is also why the holistic framework usually selects the smallest number of variables, as it needs to satisfy more constraints on subset selection $z$ .

Compared to the Lasso baseline, the holistic framework achieved comparable loss with Lasso among most tasks, while using many fewer variables to do so (usually less than half), and all of the selected variables from the framework are significant at the $5\%$ level. The original bootstrap method proposed in [1] quickly timed out as $p$ increased, resulting in suboptimal performance.

The computational results suggest that the proposed holistic linear regression algorithm greatly increases its scalability in detecting significance and avoiding multicollinearity. Using both real and synthetic data, we have demonstrated that the approach produces high quality linear regression models in realistic timelines.

Bibliography17

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] D. Bertsimas and A. King, “An algorithmic approach to linear regression,” Operations Research , vol. 64, no. 1, pp. 2–16, 2016.
2[2] E. Carrizosaa, A. V. Olivares-Nadal, and P. Ramırez-Cobob, “Enhancing interpretability by tightening linear regression models,” tech. rep., Technical report, 2017.
3[3] S. Chung, Y. W. Park, and T. Cheong, “A mathematical programming approach for integrated multiple linear regression subset selection and validation,” ar Xiv preprint ar Xiv:1712.04543 , 2017.
4[4] R. R. Hocking, “A biometrics invited paper. the analysis and selection of variables in linear regression,” Biometrics , vol. 32, no. 1, pp. 1–49, 1976.
5[5] E. R. Mansfield and B. P. Helms, “Detecting multicollinearity,” The American Statistician , vol. 36, no. 3a, pp. 158–160, 1982.
6[6] R. Tamura, K. Kobayashi, Y. Takano, R. Miyashiro, K. Nakata, and T. Matsui, “Mixed integer quadratic optimization formulations for eliminating multicollinearity based on variance inflation factor. optimization online,” Optimization Online , 2016.
7[7] R. Tamura, K. Kobayashi, Y. Takano, R. Miyashiro, K. Nakata, and T. Matsui, “Best subset selection for eliminating multicollinearity,” Journal of the Operations Research Society of Japan , vol. 60, no. 3, pp. 321–336, 2017.
8[8] R. M. O’brien, “A caution regarding rules of thumb for variance inflation factors,” Quality & quantity , vol. 41, no. 5, pp. 673–690, 2007.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Scalable Holistic Linear Regression

Abstract

keywords:

1 Introduction

2 The Framework of [1]

3 Imposing Significance Constraints

Theorem 1**.**

3.1 Constructing Significance Constraints

4 Multicollinearity Detection

Definition 1**.**

4.1 Key Result

Theorem 2**.**

Proof.

4.2 Identifying Multicollinear Relations

4.3 Computational Results

5 Holistic Linear Regression Framework Evaluation

Theorem 1.

Definition 1.

Theorem 2.