On the Limit Imbalanced Logistic Regression by Binary Predictors

Vincent Runge

arXiv:1703.08995·stat.ME·April 19, 2018

On the Limit Imbalanced Logistic Regression by Binary Predictors

Vincent Runge

PDF

Open Access

TL;DR

This paper proposes a rescaled likelihood approach for imbalanced logistic regression with binary predictors, facilitating regularization and interpretation, especially useful in pharmacovigilance data analysis.

Contribution

It introduces a novel rescaled likelihood that simplifies regularization and interpretation in imbalanced logistic regression with binary predictors.

Findings

01

Convergence of maximum likelihood estimates under class imbalance with strong overlap conditions.

02

Analytic solutions for lasso regularization paths in binary predictor models.

03

An efficient approximate path algorithm based on matrix inversions.

Abstract

In this work, we introduce a modified (rescaled) likelihood for imbalanced logistic regression. This new approach makes easier the use of exponential priors and the computation of lasso regularization path. Precisely, we study a limiting behavior for which class imbalance is artificially increased by replication of the majority class observations. If some strong overlap conditions are satisfied, the maximum likelihood estimate converges towards a finite value close to the initial one (intercept excluded) as shown by simulations with binary predictors. This solution corresponds to the extremum of a concave function that we refer to as "rescaled" likelihood. In this context, the use of exponential priors has a clear interpretation as a shift on the predictor means for the minority class. Thanks to the simple binary structure, some random designs give analytic path estimators for the lasso…

Tables4

Table 1. Table 1: Variance analysis in the square case with intercept for coefficient β 4 = 0.25 subscript 𝛽 4 0.25 \beta_{4}=0.25 . We used the following quantities : ( s d . ) 2 = 10 − 4 ∑ i = 1 10 4 ( 0.25 − ( β ^ 4 ) i ) 2 (sd.)^{2}=10^{-4}\sum_{i=1}^{10^{4}}(0.25-(\hat{\beta}_{4})_{i})^{2} , ( F . s d . ) 2 = 10 − 4 ∑ i = 1 10 4 V ( β ^ 4 ) i (F.sd.)^{2}=10^{-4}\sum_{i=1}^{10^{4}}V(\hat{\beta}_{4})_{i} and b i a s = 10 − 4 ∑ i = 1 10 4 ( 0.25 − ( β ^ 4 ) i ) 𝑏 𝑖 𝑎 𝑠 superscript 10 4 superscript subscript 𝑖 1 superscript 10 4 0.25 subscript subscript ^ 𝛽 4 𝑖 bias=10^{-4}\sum_{i=1}^{10^{4}}(0.25-(\hat{\beta}_{4})_{i}) .

	$β_{0}$	-7	-6	-5	-4	-3	-2	-1	0
$\| n \|$	$\| n_{0} \| / \| n_{1} \|$	1052	385	142	52	19	7.1	2.7	1.0
	sd.	.	.	.	.	0.768	0.467	0.333	0.286
$10^{3}$	F.sd.	.	.	.	.	0.762	0.464	0.334	0.293
	bias	.	.	.	.	0.019	0.0037	0.0065	-0.0021
	sd.	.	.	0.633	0.369	0.224	0.145	0.105	0.0931
$10^{4}$	F.sd.	.	.	0.618	0.361	0.221	0.145	0.104	0.0918
	bias	.	.	0.019	9.4e-3	-2.4e-4	1.e-3	-3.0e-4	-2.9e-4
	sd.	0.528	0.303	0.187	0.110	0.0690	0.0450	0.0328	0.0300
$10^{5}$	F.sd.	0.512	0.301	0.183	0.111	0.0685	0.0451	0.0327	0.0291
	bias	0.020	5.4e-4	3.4e-3	-3.2e-5	-5.6e-5	4.5e-4	-2.8e-5	1.9e-4

Table 2. Table 2: Variance and bias analysis in standard and imbalanced situations for coefficient β 4 = 0.25 subscript 𝛽 4 0.25 \beta_{4}=0.25 . The l 1 superscript 𝑙 1 l^{1} is given by the formula l 1 = 10 − 4 ∑ i = 1 10 4 | ( β ^ 4 i m b ) i − ( β ^ 4 ) i | superscript 𝑙 1 superscript 10 4 superscript subscript 𝑖 1 superscript 10 4 subscript superscript subscript ^ 𝛽 4 𝑖 𝑚 𝑏 𝑖 subscript subscript ^ 𝛽 4 𝑖 l^{1}=10^{-4}\sum_{i=1}^{10^{4}}|(\hat{\beta}_{4}^{imb})_{i}-(\hat{\beta}_{4})_{i}| .

	$β_{0}$	-5	-4	-3	-2	-1	0
$q_{0}$	$\| n^{0} \| / \| n^{1} \|$	141	51	19	7.0	2.6	1.0
	sd.	0.3614	0.2154	0.1331	0.08765	0.06384	0.05730
	sd. imb.	0.3614	0.2154	0.1331	0.08766	0.06402	0.05757
\cdashline2-8 $10$	bias	2.533e-3	7.572e-4	-9.553e-4	1.407e-3	2.797e-4	4.954e-4
	bias imb.	2.532e-3	7.527e-4	-9.433e-4	1.439e-3	3.261e-4	5.073e-4
\cdashline2-8	$l^{1}$	4.702e-4	6.849e-4	1.089e-3	1.762e-3	2.869e-3	4.806e-3
	sd.	0.2643	0.1583	0.09879	0.06468	0.04839	0.04382
	sd. imb.	0.2643	0.1583	0.09879	0.06474	0.04860	0.04451
\cdashline2-8 $21$	bias	2.148e-3	1.956e-3	-1.583e-5	7.601e-4	2.461e-4	-6.280e-4
	bias imb.	2.130e-3	1.967e-3	-8.450e-6	7.752e-4	3.000e-4	-5.337e-4
\cdashline2-8	$l^{1}$	5.950e-4	8.864e-4	1.399e-3	2.281e-3	3.698e-3	5.952e-3
	sd.	0.2438	0.1470	0.09346	0.06112	0.04585	0.04090
	sd. imb.	0.2438	0.1471	0.09349	0.06117	0.04628	0.04171
\cdashline2-8 $32$	bias	4.385e-3	1.034e-3	8.025e-4	3.584e-4	1.259e-3	-1.905e-4
	bias imb.	4.383e-3	1.048e-3	7.894e-4	3.537e-4	1.313e-3	-4.297e-5
\cdashline2-8	$l^{1}$	6.220e-4	9.239e-4	1.472e-3	2.341e-3	3.920e-3	6.306e-03

Table 3. Table 3: Variance and bias analysis with prior distributions for coefficient β 4 = 0.25 subscript 𝛽 4 0.25 \beta_{4}=0.25

$β_{0}$	-5	-4	-3	-2	-1	0
$\| n^{0} \| / \| n^{1} \|$	143	52	19	7.1	2.7	1.0
sd.	0.3720	0.2173	0.1328	0.08835	0.06388	0.05645
im. sd.	0.3672	0.2165	0.1326	0.08837	0.06406	0.05677
J. sd.	0.3575	0.2141	0.1322	0.08814	0.06382	0.05641
\cdashline1-7 bias	6.367e-3	4.247e-3	-8.032e-4	1.281e-3	-2.822e-4	3.048e-4
im. bias	3.284e-3	3.025e-3	-1.176e-3	1.159e-3	-3.229e-4	2.987e-4
J. bias	-1.631e-3	1.419e-3	-1.858e-3	8.354e-4	-5.028e-4	1.345e-4

Table 4. Table 4: Path model sequence analysis. The simple algorithm (a) has a stronger robustness to the presence of correlation than the analytic path obtained with an asumption of independence (i).

nb/r	0	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9
3 (i)	0.977	0.909	0.862	0.809	0.740	0.706	0.678	0.642	0.620	0.564
3 (a)	0.777	0.710	0.736	0.730	0.735	0.720	0.727	0.726	0.725	0.755
5 (i)	0.970	0.859	0.771	0.671	0.613	0.552	0.513	0.488	0.410	0.361
5 (a)	0.785	0.752	0.725	0.715	0.700	0.760	0.708	0.693	0.699	0.705
8 (i)	0.969	0.868	0.750	0.609	0.550	0.486	0.433	0.363	0.341	0.309
8 (a)	0.773	0.742	0.717	0.756	0.722	0.698	0.714	0.675	0.684	0.622

Equations302

L (β ∣ I_{0}, I_{1}, n^{0}, n^{1}) = i = 1 \prod q_{1} (\frac{e ^{(I_{1} β)_{i}}}{1 + e ^{(I_{1} β)_{i}}})^{n_{i}^{1}} i = 1 \prod q_{0} (1 + e^{(I_{0} β)_{i}})^{- n_{i}^{0}},

L (β ∣ I_{0}, I_{1}, n^{0}, n^{1}) = i = 1 \prod q_{1} (\frac{e ^{(I_{1} β)_{i}}}{1 + e ^{(I_{1} β)_{i}}})^{n_{i}^{1}} i = 1 \prod q_{0} (1 + e^{(I_{0} β)_{i}})^{- n_{i}^{0}},

l (β) = - lo g (L (β)) = ∣ n ∣ lo g 2 + i = 1 \sum q (- Δ n_{i} (\frac{1}{2} (I β)_{i}) + n_{i} lo g cosh (\frac{1}{2} (I β)_{i})),

l (β) = - lo g (L (β)) = ∣ n ∣ lo g 2 + i = 1 \sum q (- Δ n_{i} (\frac{1}{2} (I β)_{i}) + n_{i} lo g cosh (\frac{1}{2} (I β)_{i})),

0 = \frac{\partial l ( β )}{\partial β _{j}} = i = 1 \sum q (- Δ n_{i} (\frac{1}{2} I_{ij}) + \frac{1}{2} n_{i} I_{ij} tanh (\frac{1}{2} (I β)_{i})), j \in {0, ..., p},

0 = \frac{\partial l ( β )}{\partial β _{j}} = i = 1 \sum q (- Δ n_{i} (\frac{1}{2} I_{ij}) + \frac{1}{2} n_{i} I_{ij} tanh (\frac{1}{2} (I β)_{i})), j \in {0, ..., p},

I^{T} Δ n = I^{T} (n tanh (\frac{1}{2} I β)) .

I^{T} Δ n = I^{T} (n tanh (\frac{1}{2} I β)) .

I_{1}^{T} n^{1} - I_{0}^{T} n^{0} = I_{1}^{T} (n^{1} tanh (\frac{1}{2} I_{1} β)) + I_{0}^{T} (n^{0} tanh (\frac{1}{2} I_{0} β)) .

I_{1}^{T} n^{1} - I_{0}^{T} n^{0} = I_{1}^{T} (n^{1} tanh (\frac{1}{2} I_{1} β)) + I_{0}^{T} (n^{0} tanh (\frac{1}{2} I_{0} β)) .

C_{1} = {I_{1}^{T} u_{1} ∣ u_{1} \in (R_{+}^{*})^{q_{1}}} and C_{0} = {I_{0}^{T} u_{0} ∣ u_{0} \in (R_{+}^{*})^{q_{0}}} .

C_{1} = {I_{1}^{T} u_{1} ∣ u_{1} \in (R_{+}^{*})^{q_{1}}} and C_{0} = {I_{0}^{T} u_{0} ∣ u_{0} \in (R_{+}^{*})^{q_{0}}} .

\hat{β} = I^{- 1} lo g (\frac{n ^{1}}{n ^{0}}) .

\hat{β} = I^{- 1} lo g (\frac{n ^{1}}{n ^{0}}) .

I^{- 1} 1 ⋮ 1 = 10 ⋮ 0,

I^{- 1} 1 ⋮ 1 = 10 ⋮ 0,

\hat{β}_{i} = lo g (j = 0 \prod p (\frac{n _{j}^{1}}{n _{j}^{0}})^{a_{ij}}), i \in {0, ..., p}, a_{ij} = (I^{- 1})_{ij} and j = 0 \sum p a_{ij} = δ_{0 i} .

\hat{β}_{i} = lo g (j = 0 \prod p (\frac{n _{j}^{1}}{n _{j}^{0}})^{a_{ij}}), i \in {0, ..., p}, a_{ij} = (I^{- 1})_{ij} and j = 0 \sum p a_{ij} = δ_{0 i} .

V (\hat{β}_{i}) \approx j = 0 \sum p a_{ij}^{2} (\frac{1}{n _{j}^{1}} + \frac{1}{n _{j}^{0}}), i \in {0, ..., p}, a_{ij} = (I^{- 1})_{ij} .

V (\hat{β}_{i}) \approx j = 0 \sum p a_{ij}^{2} (\frac{1}{n _{j}^{1}} + \frac{1}{n _{j}^{0}}), i \in {0, ..., p}, a_{ij} = (I^{- 1})_{ij} .

\frac{n _{1}^{0}}{n _{1}^{0}} = \overline{N}^{1} + \frac{1}{s} (\frac{n _{2}^{0}}{( n _{1}^{0} ) ^{2}} (\frac{n _{2}^{0}}{n _{2}^{0}} - \overline{N}^{1}) - \frac{n _{1}^{1}}{n _{1}^{0}} (\frac{n _{1}^{1}}{n _{1}^{1}} - \overline{N}^{1})) + o (\frac{1}{s}),

\frac{n _{1}^{0}}{n _{1}^{0}} = \overline{N}^{1} + \frac{1}{s} (\frac{n _{2}^{0}}{( n _{1}^{0} ) ^{2}} (\frac{n _{2}^{0}}{n _{2}^{0}} - \overline{N}^{1}) - \frac{n _{1}^{1}}{n _{1}^{0}} (\frac{n _{1}^{1}}{n _{1}^{1}} - \overline{N}^{1})) + o (\frac{1}{s}),

n_{1}^{0} = i = 1 \sum q_{0} \overline{n}_{i}^{0} e^{(I_{0} \tilde{β})_{i}}, n_{1}^{1} = i = 1 \sum q_{1} \overline{n}_{i}^{1} e^{(I_{1} \tilde{β})_{i}}, n_{2}^{0} = i = 1 \sum q_{0} \overline{n}_{i}^{0} e^{2 (I_{0} \tilde{β})_{i}}, n_{2}^{1} = i = 1 \sum q_{1} \overline{n}_{i}^{1} e^{2 (I_{1} \tilde{β})_{i}},

n_{1}^{0} = i = 1 \sum q_{0} \overline{n}_{i}^{0} e^{(I_{0} \tilde{β})_{i}}, n_{1}^{1} = i = 1 \sum q_{1} \overline{n}_{i}^{1} e^{(I_{1} \tilde{β})_{i}}, n_{2}^{0} = i = 1 \sum q_{0} \overline{n}_{i}^{0} e^{2 (I_{0} \tilde{β})_{i}}, n_{2}^{1} = i = 1 \sum q_{1} \overline{n}_{i}^{1} e^{2 (I_{1} \tilde{β})_{i}},

n_{1}^{0} = I_{0}^{T} (\overline{n}^{0} e^{I_{0} \tilde{β}}), n_{1}^{1} = I_{1}^{T} (\overline{n}^{1} e^{I_{1} \tilde{β}}), n_{2}^{0} = I_{0}^{T} (\overline{n}^{0} e^{2 I_{0} \tilde{β}}), n_{2}^{1} = I_{1}^{T} (\overline{n}^{1} e^{2 I_{1} \tilde{β}}) .

n_{1}^{0} = I_{0}^{T} (\overline{n}^{0} e^{I_{0} \tilde{β}}), n_{1}^{1} = I_{1}^{T} (\overline{n}^{1} e^{I_{1} \tilde{β}}), n_{2}^{0} = I_{0}^{T} (\overline{n}^{0} e^{2 I_{0} \tilde{β}}), n_{2}^{1} = I_{1}^{T} (\overline{n}^{1} e^{2 I_{1} \tilde{β}}) .

I_{0}^{T} (\frac{n ^{0} e ^{I_{0} \tilde{β}}}{\sum _{i} n _{i}^{0} e ^{(I_{0} \tilde{β})_{i}}}) = \overline{N}^{1} .

I_{0}^{T} (\frac{n ^{0} e ^{I_{0} \tilde{β}}}{\sum _{i} n _{i}^{0} e ^{(I_{0} \tilde{β})_{i}}}) = \overline{N}^{1} .

tanh (\frac{x}{2}) = - 1 + 2 e^{x} + o (e^{x}) .

tanh (\frac{x}{2}) = - 1 + 2 e^{x} + o (e^{x}) .

I_{1}^{T} n^{1} = I_{1}^{T} (n^{1} e^{I_{1} β}) + I_{0}^{T} (n^{0} e^{I_{0} β}),

I_{1}^{T} n^{1} = I_{1}^{T} (n^{1} e^{I_{1} β}) + I_{0}^{T} (n^{0} e^{I_{0} β}),

exp (β_{0}) = \frac{∣ n ^{1} ∣}{\sum _{i = 1}^{q_{1}} n _{i}^{1} e ^{(I_{1} \tilde{β})_{i}} + \sum _{i = 1}^{q_{0}} n _{i}^{0} e ^{(I_{0} \tilde{β})_{i}}} \approx \frac{∣ n ^{1} ∣}{\sum _{i = 1}^{q_{0}} n _{i}^{0} e ^{(I_{0} \tilde{β})_{i}}},

exp (β_{0}) = \frac{∣ n ^{1} ∣}{\sum _{i = 1}^{q_{1}} n _{i}^{1} e ^{(I_{1} \tilde{β})_{i}} + \sum _{i = 1}^{q_{0}} n _{i}^{0} e ^{(I_{0} \tilde{β})_{i}}} \approx \frac{∣ n ^{1} ∣}{\sum _{i = 1}^{q_{0}} n _{i}^{0} e ^{(I_{0} \tilde{β})_{i}}},

I_{1}^{T} \overline{n}^{1} = I_{1}^{T} (\frac{n ^{1} e ^{I_{1} \tilde{β}}}{\sum _{i = 1}^{q_{0}} n _{i}^{0} e ^{(I_{0} \tilde{β})_{i}}}) + I_{0}^{T} (\frac{n ^{0} e ^{I_{0} \tilde{β}}}{\sum _{i = 1}^{q_{0}} n _{i}^{0} e ^{(I_{0} \tilde{β})_{i}}})

I_{1}^{T} \overline{n}^{1} = I_{1}^{T} (\frac{n ^{1} e ^{I_{1} \tilde{β}}}{\sum _{i = 1}^{q_{0}} n _{i}^{0} e ^{(I_{0} \tilde{β})_{i}}}) + I_{0}^{T} (\frac{n ^{0} e ^{I_{0} \tilde{β}}}{\sum _{i = 1}^{q_{0}} n _{i}^{0} e ^{(I_{0} \tilde{β})_{i}}})

\frac{n ^{1} e ^{I_{1} \tilde{β}}}{\sum _{i = 1}^{q_{0}} n _{i}^{0} e ^{(I_{0} \tilde{β})_{i}}} = \frac{∣ n ^{1} ∣}{∣ n ^{0} ∣} \frac{n ^{1} e ^{I_{1} \tilde{β}}}{\sum _{i = 1}^{q_{0}} n _{i}^{0} e ^{(I_{0} \tilde{β})_{i}}} \to 0

\frac{n ^{1} e ^{I_{1} \tilde{β}}}{\sum _{i = 1}^{q_{0}} n _{i}^{0} e ^{(I_{0} \tilde{β})_{i}}} = \frac{∣ n ^{1} ∣}{∣ n ^{0} ∣} \frac{n ^{1} e ^{I_{1} \tilde{β}}}{\sum _{i = 1}^{q_{0}} n _{i}^{0} e ^{(I_{0} \tilde{β})_{i}}} \to 0

H_{ij} = δ_{ij} \frac{e ^{z_{i}}}{\sum _{k = 1}^{q} e ^{z_{k}}} - \frac{e ^{z_{i}}}{\sum _{k = 1}^{q} e ^{z_{k}}} \frac{e ^{z_{j}}}{\sum _{k = 1}^{q} e ^{z_{k}}}, i, j \in {1, ..., q} .

H_{ij} = δ_{ij} \frac{e ^{z_{i}}}{\sum _{k = 1}^{q} e ^{z_{k}}} - \frac{e ^{z_{i}}}{\sum _{k = 1}^{q} e ^{z_{k}}} \frac{e ^{z_{j}}}{\sum _{k = 1}^{q} e ^{z_{k}}}, i, j \in {1, ..., q} .

i, j = 1 \sum q v_{i} H_{ij} v_{j} = \frac{( \sum _{k = 1}^{q} e ^{z_{k}} v _{k}^{2} ) ( \sum _{k = 1}^{q} e ^{z_{k}} ) - ( \sum _{k = 1}^{q} e ^{z_{k}} v _{k} ) ^{2}}{( \sum _{k = 1}^{q} e ^{z_{k}} ) ^{2}},

i, j = 1 \sum q v_{i} H_{ij} v_{j} = \frac{( \sum _{k = 1}^{q} e ^{z_{k}} v _{k}^{2} ) ( \sum _{k = 1}^{q} e ^{z_{k}} ) - ( \sum _{k = 1}^{q} e ^{z_{k}} v _{k} ) ^{2}}{( \sum _{k = 1}^{q} e ^{z_{k}} ) ^{2}},

[h (g_{1}, ..., g_{q})]^{*} (m) = m_{1} + ... + m_{q} = m α_{1} \geq 0, ..., α_{q} \geq 0 min (h^{*} (α_{1}, ..., α_{q}) + i = 1 \sum q α_{i} g_{i}^{*} (\frac{m _{i}}{α _{i}})),

[h (g_{1}, ..., g_{q})]^{*} (m) = m_{1} + ... + m_{q} = m α_{1} \geq 0, ..., α_{q} \geq 0 min (h^{*} (α_{1}, ..., α_{q}) + i = 1 \sum q α_{i} g_{i}^{*} (\frac{m _{i}}{α _{i}})),

F_{m}:\left\{\begin{array}[]{cl}\mathbb{R}^{p}&\to\mathbb{R}\,,\\ \tilde{\beta}&\mapsto m\cdot\tilde{\beta}-\log(\sum_{i}\overline{n}^{0}_{i}e^{(\mathtt{I}\tilde{\beta})_{i}})\,.\end{array}\right.

F_{m}:\left\{\begin{array}[]{cl}\mathbb{R}^{p}&\to\mathbb{R}\,,\\ \tilde{\beta}&\mapsto m\cdot\tilde{\beta}-\log(\sum_{i}\overline{n}^{0}_{i}e^{(\mathtt{I}\tilde{\beta})_{i}})\,.\end{array}\right.

\frac{\partial}{\partial β ~ _{j}} (i = 1 \sum p m_{i} \cdot \tilde{β}_{i} - f (\tilde{β})) = 0, j \in {1, ..., p},

\frac{\partial}{\partial β ~ _{j}} (i = 1 \sum p m_{i} \cdot \tilde{β}_{i} - f (\tilde{β})) = 0, j \in {1, ..., p},

f^{*}:\left\{\begin{array}[]{cl}(\mathbb{R}^{p})^{T}&\mapsto\mathbb{R}\,,\\ m&\mapsto\sup\limits_{\tilde{\beta}\in\mathbb{R}^{p}}\left(m\cdot\tilde{\beta}-\log(\sum_{i}\overline{n}^{0}_{i}e^{(\mathtt{I}\tilde{\beta})_{i}})\right)=\sup\limits_{\tilde{\beta}\in\mathbb{R}^{p}}\left(F_{m}(\tilde{\beta})\right)\,.\end{array}\right.

f^{*}:\left\{\begin{array}[]{cl}(\mathbb{R}^{p})^{T}&\mapsto\mathbb{R}\,,\\ m&\mapsto\sup\limits_{\tilde{\beta}\in\mathbb{R}^{p}}\left(m\cdot\tilde{\beta}-\log(\sum_{i}\overline{n}^{0}_{i}e^{(\mathtt{I}\tilde{\beta})_{i}})\right)=\sup\limits_{\tilde{\beta}\in\mathbb{R}^{p}}\left(F_{m}(\tilde{\beta})\right)\,.\end{array}\right.

A=\bigg{\{}m\in(\mathbb{R}^{p})^{T}\,|\,\exists\tilde{\beta}\in\mathbb{R}^{p}\,,\,\mathtt{I}^{T}\left(\frac{\overline{n}^{0}e^{\mathtt{I}\tilde{\beta}}}{\sum_{i}\overline{n}^{0}_{i}e^{(\mathtt{I}\tilde{\beta})_{i}}}\right)=m^{T}\bigg{\}}\,,

A=\bigg{\{}m\in(\mathbb{R}^{p})^{T}\,|\,\exists\tilde{\beta}\in\mathbb{R}^{p}\,,\,\mathtt{I}^{T}\left(\frac{\overline{n}^{0}e^{\mathtt{I}\tilde{\beta}}}{\sum_{i}\overline{n}^{0}_{i}e^{(\mathtt{I}\tilde{\beta})_{i}}}\right)=m^{T}\bigg{\}}\,,

B=\bigg{\{}m\in(\mathbb{R}^{p})^{T}\,|\,f^{*}(m)<+\infty\bigg{\}}\,,

B=\bigg{\{}m\in(\mathbb{R}^{p})^{T}\,|\,f^{*}(m)<+\infty\bigg{\}}\,,

C=\bigg{\{}m\in(\mathbb{R}^{p})^{T}\,|\,\exists\,\lambda\in(\mathbb{R^{*}_{+}})^{q}\,,\,m=\lambda^{T}\mathtt{I}\,,\,\sum_{i=1}^{q}\lambda_{i}=1\bigg{\}}\,.

C=\bigg{\{}m\in(\mathbb{R}^{p})^{T}\,|\,\exists\,\lambda\in(\mathbb{R^{*}_{+}})^{q}\,,\,m=\lambda^{T}\mathtt{I}\,,\,\sum_{i=1}^{q}\lambda_{i}=1\bigg{\}}\,.

h^{*}(\alpha_{1},...,\alpha_{q})=\left\{\begin{array}[]{cc}\sum_{i=1}^{q}\alpha_{i}\log(\alpha_{i})&if\,\,\alpha_{1}\geq 0,...,\alpha_{q}\geq 0\,,\,\alpha_{1}+...+\alpha_{q}=1\,,\\ +\infty&otherwise\,,\end{array}\right.

h^{*}(\alpha_{1},...,\alpha_{q})=\left\{\begin{array}[]{cc}\sum_{i=1}^{q}\alpha_{i}\log(\alpha_{i})&if\,\,\alpha_{1}\geq 0,...,\alpha_{q}\geq 0\,,\,\alpha_{1}+...+\alpha_{q}=1\,,\\ +\infty&otherwise\,,\end{array}\right.

f^{*} (m) = [h (g_{1}, ..., g_{q})]^{*} (m) = α_{1} I_{1} + ... + α_{q} I_{q} = m α_{1} + ... + α_{q} = 1 α_{1} \geq 0, ..., α_{q} \geq 0 min (i = 1 \sum q α_{i} lo g (α_{i}) + i = 1 \sum q α_{i} (- b_{i}))

f^{*} (m) = [h (g_{1}, ..., g_{q})]^{*} (m) = α_{1} I_{1} + ... + α_{q} I_{q} = m α_{1} + ... + α_{q} = 1 α_{1} \geq 0, ..., α_{q} \geq 0 min (i = 1 \sum q α_{i} lo g (α_{i}) + i = 1 \sum q α_{i} (- b_{i}))

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImbalanced Data Classification Techniques · Statistical Methods and Inference · Advanced Statistical Process Monitoring

Full text

The Limit Imbalanced Logistic Regression by Binary Predictors and its fast Lasso computation

Vincent Runge111E-mail: [email protected]

LaMME - Laboratoire de Mathématiques et Modélisation d’Evry.

UEVE - Université d’Evry-Val-d’Essonne.

Abstract

In this work, we introduce a modified (rescaled) likelihood for imbalanced logistic regression. This new approach makes easier the use of exponential priors and the computation of lasso regularization path. Precisely, we study a limiting behavior for which class imbalance is artificially increased by replication of the majority class observations. If some strong overlap conditions are satisfied, the maximum likelihood estimate converges towards a finite value close to the initial one (intercept excluded) as shown by simulations with binary predictors. This solution corresponds to the extremum of a strictly concave function that we refer to as ”rescaled” likelihood. In this context, the use of exponential priors has a clear interpretation as a shift on the predictor means for the minority class. Thanks to the simple binary structure, some random designs give analytic path estimators for the lasso regularization problem. An effective approximate path algorithm by piecewise logarithmic functions based on matrix inversions is also presented. This work was motivated by its potential application to spontaneous reports databases in a pharmacovigilance context.

Keywords: path estimator, pharamacovigilance model, piecewise logarithmic approximate path, limit class imbalance, rescaled likelihood, spontaneous reports database, square exact solution.

MS classification : Primary 62J12, 62F12, 62F15; secondary 34E05, 49M29, 62P10.

1 Introduction

If the response $y=1$ is very rare compared with the response $y=0$ , we are in presence of a rare event configuration also called class imbalance. This problem recently got computer scientists’ attention: they aimed at reducing computational costs by bypassing the class imbalance with resampling methods [12] [21] [6]. With these methods, the variance in estimating model parameters increases. Statisticians are aware of this problem and complex procedures such as local case-control sampling were proposed [8] (a method initiated in epidemiology [19]).

In a recent work (2007) by Art B. Owen [22], the opposite approach is considered: the class imbalance is infinitely increased in order to reach the theoretical distribution of the majority class observations. Owen proved that under some overlap conditions the model parameters are finite (apart from the intercept) and built a limit system of equations related to exponential tilting, whose solution is the new estimate. The resulting equations include the distribution of the infinite class expressed through integrals, which are not easy to infer. This may explain that this work was broadly ignored (The author found it when Sections 2 and 3 were already completed).

In our approach, the observations of the majority class are infinitely replicated and the Owen’s limit distribution becomes the observed distribution. This situation is a kind of degenerate case between resampling (we repeat observations) and infinitely class imbalance (the observed distribution is chosen as the theoretical one). Unlike Owen’s result, our limit normal equations can be interpretated as the first order conditions of a new likelihood.

The idea of this work comes from the analysis of highly imbalanced binary spontaneous reports databases. Such databases are gathered by many countries and institutions (FDA, MHRA, WHO,…). Imbalanced logistic regression with binary predictors gives maximum likelihood estimate (MLE) very close to its limit imbalanced counterpart. This result makes possible the study of lasso-type regularization problem and the development of effective algorithms to provide model selection.

So far, only disproportionality methods are routinely used [18] for spontaneous report databases: predictors are analysed one by one, leading to a great number of false positive signals [13]. Mathematical tools adjusted to binary data for regression are surprisingly barely developed by scientists (only boolean matrices have been studied by some authors [16]). This results in an inflation of empiric methods using lasso regularization in recent years (from [3] to [1]). This is a worrying trend because recommendations made by these experts shift towards more complicated experimental methods and time-consuming algorithms, not towards a deeper mathematical understanding. This work is motivated by the need to better analyse this kind of applied problem.

The paper contains three main sections in which we present the following results:

•

In Section 2, we investigate the properties of the logistic normal equations with binary predictors. Simple existence and uniqueness conditions of Silvapulle’s type are found and some exact solutions presented. An invariance property in presence of intercept links this particular solution (called ”square solution”) to the limit imbalanced problem. We then acquaint ourselves with the issue of variance inflation of the imbalanced problem by computing the Fisher information.

•

In Section 3, we derive Owen-type equations with a first order term evaluating the convergence rate. For the limit system of equations, the existence and uniqueness of the solution is proved with a new method leading to the minimization of a Kullback-Leibler divergence under linear constraints. A rescaling procedure on the initial likelihood and the previously found divergence justify the introduction of a rescaled likelihood corresponding to our limit imbalanced logistic regression problem. In a Bayesian framework, the Jeffreys penalty does not significantly decrease the variance of the estimator but other more appropriate priors, such that exponential ones, could help to reduce it (chosen according to the situation). The closeness in simulation between limit estimates and classical estimates compels us to go one step further with the study of regularization paths, in particular if the model is known to be sparse.

•

In Section 4, we look at a lasso regularization problem for the rescaled likelihood, which has a clear interpretation as a shift on the predictor means for the class of interest. We succeed in finding some path estimators in a few particular cases (independence and orthogonal design). In presence of correlation, we present an effective path following algorithm by piecewise logarithmic functions giving precise estimates. We conclude by explaining the need of an analysis of the correlation structure between predictors. This leads to simple algorithmic procedures with small computational costs for which many different prior penalties could be easily tested. Two examples are given using the French spontaneous reports database.

The expressions ”infinitely imbalance” and ”limit imbalance” are considered as synonymous, although we recommend the use of the second one in our context due to the simple unique limit we impose and an analogy with hydrodynamic limits (in fluid dynamics) while the first expression is related to the underlying distribution introduced by Owen.

We conclude this article by discussing the many opportunities that arise with the introduction of a rescaled likelihood in a Bayesian context and of the path following algorithm by logarithmic functions.

2 The logistic regression by binary predictors

2.1 Logistic normal equations

The binary logistic regression (BLR) problem consists in the determination of coefficients $\hat{\beta}$ maximizing a smooth and concave likelihood function given by the relation

[TABLE]

where $\beta=(\beta_{i})\in\mathbb{R}^{p+1}$ is indexed from zero with $\beta_{0}$ corresponding to the intercept. Binary design matrices $I_{1}\in\mathcal{M}_{q_{1}\times(p+1)}(\mathbb{B})$ and $I_{0}\in\mathcal{M}_{q_{0}\times(p+1)}(\mathbb{B})$ with $\mathbb{B}=\{0,1\}$ are of full rank: they aggregate the $p$ binary predictors. Vectors of weights $n^{0}=(n_{1}^{0},...,n_{q_{0}}^{0})^{T}\in(\mathbb{N}^{*})^{q_{0}}$ and $n^{1}=(n_{1}^{1},...,n_{q_{1}}^{1})^{T}\in(\mathbb{N}^{*})^{q_{1}}$ save repetitions for distinct observations in response classes [math] and $1$ separately. The binary structure favours repetitions in the sequence of observations, which justifies these notations. Moreover $(I_{0}\beta)_{i}$ is the i-th component of vector $I_{0}\beta\in\mathbb{R}^{q_{0}}$ (the same for $(I_{1}\beta)_{i}$ ).

We introduce other notations thereafter used within this article. The modulus of a vector denotes its $l^{1}$ norm, while the overline sign on lower cases stands for $l^{1}$ normalization. For example $|n^{1}|=\sum_{i=1}^{q_{1}}n_{i}^{1}$ and $\overline{n}^{1}_{i}=n^{1}_{i}/|n^{1}|$ gives the vector $\overline{n}^{1}$ . $A_{i}$ is the i-th row of the matrix $A$ and its roman upper case equivalent $\mathtt{A}$ is the matrix $A$ in which the first column filled by ones (associated to the intercept) was removed. We also need $N^{1}=\mathtt{I}_{1}^{T}n^{1}\in\mathbb{R}^{p}$ with $T$ standing for the matrix transpose operator. An important feature in our study is the predictor means vector $\overline{N}^{1}$ for class $1$ obtained by the relation $\mathtt{I}^{T}_{1}\overline{n}^{1}=\overline{N}^{1}$ . For vectors of same size $u,\,v\in\mathbb{R}^{q}$ , $uv$ (resp. $\frac{u}{v}$ ) is the vector with components $u_{k}v_{k}$ (resp. $\frac{u_{k}}{v_{k}}$ ), $k\in\{1,...,q\}$ . $\tilde{\beta}$ is the vector $\beta$ without the intercept coefficient $\beta_{0}$ . From Subsection 3.2, the notations $I$ and $\mathtt{I}$ for matrices $I_{0}$ and $\mathtt{I}_{0}$ respectively are often used (as well as $q$ for integer $q_{0}$ ).

For ease of calculation, we consider the opposite of the log-likelihood. If $I_{1}=I_{0}=\mathcal{I}$ , we have $q_{1}=q_{0}=q$ and we can introduce vectors $n=n^{1}+n^{0}$ and $\Delta n=n^{1}-n^{0}$ . In this latter case, we write

[TABLE]

and first order conditions are computed, differentiating $l$ with respect to each $\beta_{j}$ coefficient. We obtain

[TABLE]

or in matrix form

[TABLE]

In a general framework with non-identical matrices $I_{0}$ and $I_{1}$ , we likewise derive

[TABLE]

This system of equations (2.2) gathers the so-called logistic normal equations and will be widely used within this article.

Remark 2.1.

These equations are usually presented with a logistic function but we chose another expression to highlight the link with existence and uniqueness conditions.

2.2 Existence and uniqueness

Necessary and sufficient conditions to ensure existence and uniqueness of the MLE are well-known, they were established by Silvapulle in 1981 [27]. They consist in satisfying an overlap condition $C_{1}\cap C_{0}\neq\emptyset$ between the cones

[TABLE]

For the BLR problem, a more convenient description is possible:

Theorem 2.1.

The BLR problem admits a unique solution if and only if there exist $n^{+}\in(\mathbb{N}^{*})^{q_{1}}$ and $n^{*}\in(\mathbb{N}^{*})^{q_{0}}$ , such that $I_{1}^{T}n^{+}=I_{0}^{T}n^{*}$ .

Looking at equations (2.2), this theorem means that a MLE exists and is unique if one can find a couple $(n^{+},n^{*})$ of observations of the rows in $I_{1}$ and in $I_{0}$ such that $|n^{+}|=|n^{*}|$ vanishing all the regression coefficients (intercept included). An easy necessary condition to check is that at least one [math] and one $1$ are present in each column of $I_{0}$ and $I_{1}$ (at the exception of the first column of ones corresponding to intercept).

Proof.

If $I_{1}^{T}n^{+}=I_{0}^{T}n^{*}$ , the Silvapulle’s condition is immediately verified. Reciprocally, $C_{1}\cap C_{0}$ is an open subset of $\mathbb{R}^{p+1}$ with positive measure because $I_{0}$ and $I_{1}$ are full rank matrices. By a density argument, there exist $q\in(\mathbb{Q}\cap]0,1[)^{p+1}$ , $\lambda\in(\mathbb{R}^{*}_{+})^{q_{0}}$ and $\mu\in(\mathbb{R}^{*}_{+})^{q_{1}}$ satisfying $I^{T}_{0}\lambda=I^{T}_{1}\mu=q$ . We reorder the rows in $I_{0}$ and $I_{1}$ such that the first $p+1$ rows are linearly independent. Let $H_{0}$ in $\mathcal{M}_{q_{0}\times(q_{0}-p-1)}(\mathbb{R})$ and $H_{1}$ in $\mathcal{M}_{q_{1}\times{(q_{1}-p-1)}}(\mathbb{R})$ be orthogonal matrices to $I_{0}$ and $I_{1}$ respectively. Because of the reorganization of the rows in $I_{i}$ ( $i\in\{0,1\}$ ) we can choose a $H_{i}$ where its last $q_{i}-p-1$ rows form an identity matrix $\mathbb{I}_{q_{i}-p-1}$ . For all $\alpha_{0}\in\mathbb{R}^{q_{0}-p-1}$ and $\alpha_{1}\in\mathbb{R}^{q_{1}-p-1}$ we have the relation $I^{T}_{0}(\lambda+H_{0}\alpha_{0})=I^{T}_{1}(\mu+H_{1}\alpha_{1})=q$ . Again with a density argument, we find $\alpha_{0}$ such that $\lambda_{i}+(\alpha_{0})_{i}\in\mathbb{Q}^{*}_{+}$ for all $i\in\{p+2,...,q_{0}\}$ and satisfying the constraint $\lambda+H_{0}\alpha_{0}\in(\mathbb{R}_{+}^{*})^{q_{0}}$ . For a matrix $A\in\mathcal{M}_{n\times m}(\mathbb{R})$ , a vector $v\in\mathbb{R}^{m}$ and $J\subset\{1,...,m\}$ , let $[Av]_{J}$ denote the vector $A_{J}v_{J}$ , where $A_{J}$ (resp. $v_{J}$ ) corresponds to the submatrix of $A$ (resp. subvector of $v$ ) obtained by removing from $A$ (resp. from $v$ ) the columns (resp. rows) that do not correspond to the indices in $J$ . With this notation, we have $[I^{T}_{0}(\lambda+H_{0}\alpha_{0})]_{\{1,...,p+1\}}=q-[I^{T}_{0}(\lambda+H_{0}\alpha_{0})]_{\{p+2,...,q_{0}\}}\in\mathbb{Q}^{p+1}$ . The binary matrix $(I^{T}_{0})_{\{1,...,p+1\}}$ is then nonsingular and using its inverse in $\mathcal{M}_{(p+1)\times(p+1)}(\mathbb{Q})$ we obtain $(\lambda+H_{0}\alpha_{0})_{\{1,...,p+1\}}\in(\mathbb{Q}_{+}^{*})^{p+1}$ . Finally $\alpha_{0}^{*}=\lambda+H_{0}\alpha_{0}\in(\mathbb{Q}_{+}^{*})^{q_{0}}$ . The same arguments lead to a set of coefficients $\alpha_{1}^{*}=\mu+H_{1}\alpha_{1}\in(\mathbb{Q}_{+}^{*})^{q_{1}}$ . Multiplying the vector $(\alpha_{0}^{*},\alpha_{1}^{*})$ by the ppcm of all its denominators proves the result.∎

2.3 The square case

The situation with identical square design matrices $I_{0}$ and $I_{1}$ is worthwhile in itself because it leads to explicit analytic formulae for the MLE and their variance (in the asymptotic case). In particular, we focus on the introduction of imbalance between $n^{1}$ and $n^{0}$ to emphasize the simple solution for MLE and the problem of variance inflation.

Theorem 2.2.

If $I_{0}=I_{1}=\mathrm{I}$ is a square matrix, we have the following closed form for the maximum likelihood estimator:

[TABLE]

Proof.

the matrix $\mathrm{I}$ verifies the condition $q=p+1$ and is nonsingular with $\mathrm{I}^{-1}$ its inverse (because $\mathrm{I}$ is of full rank). The vector $T$ is defined as $T=\tanh\left(\frac{1}{2}\mathrm{I}\beta\right)\in\mathbb{R}^{p+1}$ i.e. $\mathrm{I}\beta=\log\left(\frac{1+T}{1-T}\right)$ . Multiplying (2.1) by $(\mathrm{I}^{-1})^{T}=(\mathrm{I}^{T})^{-1}$ , we get $T=\frac{\Delta n}{n}$ . Hence, $\beta=\mathrm{I}^{-1}\log\left(\frac{n+\Delta n}{n-\Delta n}\right)$ , which achieves the proof. ∎

Remark 2.2.

If one of the components in the vectors of weights $n^{1}$ or $n^{0}$ vanishes, some of the regression coefficients become infinite (but not necessarily all of them).

To our knowledge, this is the first general closed form found in the resolution of a logisitic regression. There exist partial results for a unique categorical predictor exposed by Lipovetsky in 2014 [17]. An explanation for the lack of such a simple result stands in the poorly studied finite observation structure made possible through binary predictors with repetitions. In Appendix A, some particular solutions to equations (2.3) are presented.

2.3.1 Invariance if intercept

We establish an invariance property making a link with the imbalanced problem.

Proposition 2.1.

In the square case with intercept, multiplying all the components of $n^{1}$ or $n^{0}$ by a same integer does not change the value of the MLE apart from the intercept.

Proof.

The inverse of a matrix with an intercept term verifies the relation

[TABLE]

which means that we can rewrite equations (2.3) as

[TABLE]

Substituting $n_{j}^{0}$ by $s\times n_{j}^{0}$ (or $n_{j}^{1}$ by $s\times n_{j}^{1}$ ) with $s\in\mathbb{N}^{*}$ gives the same result for all $\beta_{i}\,,\,i\in\{1,...,p\}$ . ∎

2.3.2 Asymptotic variance

To conclude this section, we study the asymptotic behavior of the estimator for large $|n^{1}|$ and $|n^{0}|$ . Since the MLE (intercept excluded) remains the same with or without a class imbalance (see the invariance property), we have a glimpse of a general property in class imbalance.

Proposition 2.2.

In the square case BLR problem, the variance of the maximum likelihood estimator is approximately given by relations

[TABLE]

Proof.

We compute the observed Fisher information $\mathcal{I}(\hat{\beta})=\mathrm{I}^{T}D\mathrm{I}$ with $D$ a diagonal matrix with elements $n_{i}\hat{p}_{i}(1-\hat{p}_{i})$ and $\hat{p}_{i}=1/(1+e^{-(I\hat{\beta})_{i}})$ . Its inverse gives the desired result, knowing that $n_{i}\hat{p}_{i}=n_{i}^{1}$ and $n_{i}(1-\hat{p}_{i})=n_{i}^{0}$ . ∎

Remark 2.3.

Another method uses the closed form (2.3) to perform variance and bias estimations by Taylor expansions with the multinomial random vector $(n^{1},n^{0})$ . We obtain $V(\hat{\beta}_{i})\approx\sum_{j=0}^{p}a_{ij}^{2}(\frac{1}{n_{j}^{1}}+\frac{1}{n_{j}^{0}}-\frac{2}{|n|})$ and $Bias(\hat{\beta}_{i})\approx\sum_{j=0}^{p}\frac{a_{ij}}{2}(\frac{1}{n_{j}^{0}}-\frac{1}{n_{j}^{1}})\,,i\in\{0,...,p\}\,$ . However, simulations give inaccurate results and only the Fisher information method should be retained.

We investigate the variation of the variance with respect to the sample size $|n|$ and the value of the intercept $\beta_{0}$ for a simple fixed model $(\beta_{1},...,\beta_{5})=(-0.5,-0.25,0,0.25,0.5)$ . With these two parameters given, we simulate $10^{4}$ data sets with a different random binary square matrix $\mathrm{I}$ and different random vectors $n^{1}$ and $n^{0}$ for each of them (but $|n^{1}|+|n^{0}|$ is fixed). In table 1, we compare the estimated standard deviation (sd.) with the Fisher standard deviation given in Proposition 2.2 (F.sd.) accompanied by an estimation of the bias (bias) for coefficient $\beta_{4}=0.25$ .

These simulations highlight the accuracy of the ”Fisher variance” in all configurations, which is very close to the estimated one. Bias is negligible compared with variance. For a constant number of observations $|n|$ , the variance increases when the disbalance between classes strengthens. This variance inflation is a key issue in class imbalance, we further explain how one can easily add a prior information to a rescaled likelihood to deal with this problem (see Subsection 3.4).

3 Limit imbalanced study

3.1 Owen-type equations

The limit case consists in infinitely replicating the majority class observations as if the theoretical distribution of this class was the observed one. This is a degenerate case of the Owen’s study, that is why we know that the intercept coefficient tends to minus infinity whereas other regression coefficients are finite if a stronger overlap condition is satisfied [22]. For the limit equations, an information reduction for the majority class occurs: only the means of the predictors matter, the correlation structure in this class of interest ”disappears”.

The following proposition presents the logistic normal equations (2.2) in a new form with a remainder term arising in case of class imbalance.

Proposition 3.1.

For an imbalanced binary logisitic regression with a class size for response $y=0$ ’ $s$ ’ times greater than the one for response $y=1$ , we obtained the system of equations

[TABLE]

with $s=\frac{|n^{0}|}{|n^{1}|}\gg 1$ . We used notations:

[TABLE]

and for vectors in $\mathbb{R}^{p}$ :

[TABLE]

The technical proof of this result is exposed in Appendix B.

As shown by simulations (see table 2), the first order and remainder terms are negligible quantities with binary predictors, even if there is no imbalance! This suggests the introduction of the following limit imbalanced equations, obtained with $s=+\infty$ in Proposition 3.1.

Theorem 3.1.

*For infinitely imbalanced binary logisitic regression verifying a strong overlap condition (see Theorem 3.2), the following system of $p$ limit imbalanced equations holds222With non-binary design matrices $X_{1}$ and $X_{0}$ and no vectors of weights, we obtain $\mathtt{X}_{0}^{T}\left(\frac{e^{\mathtt{X}_{0}\tilde{\beta}}}{\sum_{i}e^{(\mathtt{X}_{0}\tilde{\beta})_{i}}}\right)=\overline{N}^{1}\,.$ These equations also differ from Owen’s [22]. *

[TABLE]

Notice that the $\tilde{\beta}$ coefficients do not depend on the structure in rows of the design matrix associated to response $y=1$ but only on the means of ones for each predictor: $\overline{N}^{1}$ .

We give a direct simple proof, avoiding the complicated previous proof of Appendix B.

Proof.

For $x$ near minus infinity, the hyperbolic tangent has the following first order expansion:

[TABLE]

From [22] we know that the intercept term tends to minus infinity, then with $x=I_{0}\beta$ or $x=I_{1}\beta$ , we use the previous expansion neglecting the remainder term. Thus, equations (2.2) become

[TABLE]

and factoring by $\exp(\beta_{0})$ in the first equation of this system we have

[TABLE]

because $\frac{|n^{0}|}{|n^{1}|}\to+\infty$ . Looking back at (3.2) without the first equation, we have

[TABLE]

but

[TABLE]

because $\frac{|n^{1}|}{|n^{0}|}\to 0$ and we obtain the desired result. ∎

In table 2, we present simulation results based on limit imbalanced equations (3.1) compared with classical logistic regression (2.2). The sample procedure is the same as the one used for table 1 except that we fixed sample size at $|n|=|n^{1}|+|n^{0}|=10^{4}$ and vary dimension for the matrix $I_{0}$ (we chose $q_{0}=10,21,32$ ).

The two estimates $\hat{\beta}_{4}$ for standard and imbalanced regressions are very close to each other as shown by the mean of the $l^{1}$ norm – even if the problem is not imbalanced – so that standard deviation and bias are almost the same. This means that, if interesting properties can be established with the limit equations, this context will be appropriate to highlight new features in classical logistic regression.

The $1/s$ first order term in Proposition 3.1 should be estimated to understand how good the limit imbalanced approximation is, without having to estimate the standard regression coefficients. Simulations show that this term is very small and we choose not to dwell on this intermediate situation, but it could be a more important result if non-binary design matrices are involved.

3.2 Strong overlap condition and rescaled likelihood

Existence and uniqueness conditions to solve (3.1) are well-known [22], they consist in an overlap condition a little bit stronger than the one given by Silvapulle. In fact, we need the point $\overline{N}^{1}$ to be surrounded by the rows of $\mathtt{I}_{0}$ (hereafter denoted by the letter $\mathtt{I}$ ). We give this result in the framework of the binary problem (simpler than Owen’s general case) and establish a new proof leading to a minimum relative entropy problem. From there and using duality, we build the corresponding rescaled likelihood also justified by a rescaling on the initial likelihood.

Theorem 3.2.

There exists a unique finite solution to the limit imbalanced BLR problem if and only if there exists $\lambda\in(\mathbb{R}_{+}^{*})^{q}$ such that $\mathtt{I}^{T}\lambda=\overline{N}^{1}$ and $\sum_{i=1}^{q}\lambda_{i}=1$ . (If present, the null row (such that $\mathtt{I}_{i}=(0,...,0)$ ) is removed333in order to have non-zero coefficients $\lambda$ as for the overlap condition in Theorem 2.1..)

Remark 3.1.

The condition $\mathtt{I}^{T}\lambda=\overline{N}^{1}$ means that we have $I^{T}_{0}\lambda=I^{T}_{1}\overline{n}^{1}$ with $\lambda\in(\mathbb{R}_{+}^{*})^{q_{0}}$ and $\overline{n}^{1}\in(\mathbb{R}_{+}^{*})^{q_{1}}$ so that $C_{1}\cap C_{0}\neq\emptyset$ . In other words, the existence and uniqueness of a solution for the limit problem implies existence and uniqueness for its associated BLR problem.

Our proof of this theorem is based on the following three lemmas.

Lemma 3.1.

The log-sum-exp function $h:\mathbb{R}^{q}\mapsto\mathbb{R}$ , defined by $h(z)=\log(\sum_{i=1}^{q}e^{z_{i}})$ is a convex, continuous, increasing function on $\mathbb{R}^{q}$ . The function $f:\mathbb{R}^{p}\mapsto\mathbb{R}$ , $f(\tilde{\beta})=\log(\sum_{i=1}^{q}\overline{n}^{0}_{i}e^{(\mathtt{I}\tilde{\beta})_{i}})$ is continuous and convex on $\mathbb{R}^{p}$ .

Proof.

Function h has a positive semi-definite Hessian and is then convex. Furthermore for all $y,z\in\mathbb{R}^{q}$ such that $y_{i}\leq z_{i}$ , $i\in\{1,...,q\}$ , we have $h(y)\leq h(z)$ and the function is increasing on $\mathbb{R}^{q}$ . The composition with an affine mapping preserves continuity and convexity. Thus, with $z=\mathtt{I}\tilde{\beta}+b$ and $\overline{n}^{0}=e^{b}$ we obtain a convex continuous $f(\tilde{\beta})=h(\mathtt{I}\tilde{\beta}+b)$ and $dom\,f=\mathbb{R}^{p}$ . ∎

Lemma 3.2.

The function $f:\mathbb{R}^{p}\mapsto\mathbb{R}$ , $f(\tilde{\beta})=\log(\sum_{i=1}^{q}\overline{n}^{0}_{i}e^{(\mathtt{I}\tilde{\beta})_{i}})$ is strictly convex on $\mathbb{R}^{p}$ .

Proof.

The Hessian $H$ of $h:\mathbb{R}^{q}\mapsto\mathbb{R}$ , $h(z)=\log(\sum_{i=1}^{q}e^{z_{i}})$ , is the following:

[TABLE]

For all $v=(v_{1},...,v_{q})^{T}\in\mathbb{R}^{q}$ , we have

[TABLE]

which is non-negative due to the Cauchy-Schwarz inequality. This expression is equal to zero if and only if there exists $\lambda\in\mathbb{R}$ such that $e^{z_{k}}v_{k}^{2}=\lambda e^{z_{k}}\,,\forall k\in\{1,...,q\}$ . Thus, only in the constant direction $z_{k}(t)=t+z_{k}(0)\,,k\in\{1,...,q\},t\in\mathbb{R}$ , the function $h$ is affine, in any others, this function is strictly convex.

Suppose that there exists a family of parameters $F_{a}=\{\tilde{\beta}(t)\in\mathbb{R}^{p},t\in[0,a],a>0\}$ such that $z(t)=\mathtt{I}\tilde{\beta}(t)+b=t+z(0)$ and $e^{b}=\overline{n}^{0}$ . This means that along the path described by $\tilde{\beta}(t)$ the function $f$ is affine. We obtain $\mathtt{I}(\tilde{\beta}(t)-\tilde{\beta}(0))=t$ and with $t\neq 0$ , we have $\gamma=(-t,\tilde{\beta}(t)-\tilde{\beta}(0))^{T}\in\mathbb{R}^{p+1}\setminus\{0\}^{p+1}$ such that $I\gamma=0$ . This is impossible because the matrix $I$ is of full rank, which proves the lemma. ∎

We present a corollary to a theorem on the Legendre-Fenchel transform of convex composite functions exposed in [14].

Lemma 3.3.

If functions $g_{i}:\mathbb{R}^{p}\mapsto\mathbb{R}$ , $i\in\{1,...,q\}$ are convex and continuous with $dom\,g_{i}=\mathbb{R}^{p}$ and $h:\mathbb{R}^{q}\mapsto\mathbb{R}$ is convex, continuous and increasing with $dom\,h=\mathbb{R}^{q}$ , then the convex conjugate of $h(g_{1},...,g_{q})$ is given by

[TABLE]

with $m\in(\mathbb{R}^{p})^{T}$ .

Proof of the theorem.

Let us define the function $F_{m}$ such that

[TABLE]

$F_{m}$ is differentiable on $\mathbb{R}^{p}$ and the first order equations

[TABLE]

are equal to the system (3.1) with $\overline{N}^{1}=m^{T}$ . Function $F_{m}$ is strictly concave as the sum of a concave function and a strictly concave function (see Lemma 3.2). Consequently, the solution $\gamma$ to $\nabla F_{\overline{N}^{1}}(\gamma)=0$ is unique.

We now introduce the convex conjugate of the function $f$ :

[TABLE]

We will prove that the three following sets are identical

[TABLE]

i) $A\subset B$ . If $m_{0}\in A$ there exists $\tilde{\beta}\in\mathbb{R}^{p}$ solution to (3.1), that is $\nabla F_{m_{0}}(\tilde{\beta})=0$ . Moreover $f^{*}(m_{0})=F_{m_{0}}(\tilde{\beta})$ because of the strict concavity of $F_{m_{0}}$ . Thus $m_{0}\in B$ .

ii) $B\subset C$ . We use the Lemma 3.3 with $g_{i}(\tilde{\beta})=(\mathtt{I}\tilde{\beta})_{i}+b_{i}$ and $h$ the log-sum-exp function verifying the necessary conditions (Lemma 3.1). We have the convex conjugate $g_{i}^{*}(u_{i})=-b_{i}$ if $u_{i}=\mathtt{I}_{i}$ and $+\infty$ elsewhere (we do not consider the presence of a null row $\mathtt{I}_{i}=(0,...,0)$ ). The only way to obtain a finite result is to impose the constraint $u_{i}=\frac{m_{i}}{\alpha_{i}}=\mathtt{I}_{i}$ for all $i\in\{1,...,q\}$ . Therefore, knowing that

[TABLE]

we have

[TABLE]

We minimize a Kullback–Leibler divergence between two distributions under linear constraints. If one of the $\alpha_{i}$ is zero, $0g_{i}^{*}(\frac{m_{i}}{0})=\sigma_{dom\,g_{i}}(m_{i})=0$ if $m_{i}=0$ elsewhere $+\infty$ (see [14]) and the previous equalities remain true with $m_{i}=\alpha_{i}\mathtt{I}_{i}$ . The KKT conditions of this problem impose the constraint $\alpha_{i}>0$ for all $i\in\{1,...,q\}$ . Thus,

[TABLE]

This minimum exists: this is a linear restriction to a convex and continuous function in a simplex and therefore $B=dom\,f^{*}\subset C$ .

iii) $C\subset B\subset A$ . If $m_{0}\in C$ , then there exists $\lambda\in(\mathbb{R}^{*}_{+})^{q}$ such that $\sum_{i=1}^{q}\lambda_{i}=1$ and $\lambda^{T}\mathtt{I}=m_{0}$ so that

[TABLE]

If the supremum is reached, there is a miximizing element $\gamma\in\mathbb{R}^{p}$ and $m_{0}\in B$ , this element is the solution to the system (3.1) and thus $m_{0}\in A$ . To state this result, it is enough to have $-F_{m_{0}}$ coercive. Let $\epsilon\in\mathbb{R}^{p}\setminus\{0\}^{p}$ be an arbitrary vector and $\tilde{\beta}=x\epsilon$ with $x\in\mathbb{R}$ . Then,

[TABLE]

with $\omega=\mathtt{I}\epsilon$ . Notice that the vector $\omega$ can not satisfy the relations $\omega_{1}=...=\omega_{p}$ because $I$ is of full rank. Thus, if $W=\max\limits_{i\in\{1,...,q\}}(\omega_{i})$ , we have

[TABLE]

and $\sum_{i}(\lambda_{i}\omega_{i})-W<0$ because $\sum\lambda_{i}=1$ . Therefore, with $\Omega=\{i\in\{1,...,p\}\,|\,\omega_{i}=W\}$

[TABLE]

which proves that the function $-F_{m_{0}}$ is coercive when $m_{0}\in C$ and achieves the proof. ∎

The expression $f^{*}(m)$ in (3.4) is the minimization of a relative entropy between the class 0 distribution and a kind of ghost class 1 distribution (built on the $I=I_{0}$ design matrix). With the duality property, we can introduce a new likelihood. The following proposition leads to the same ”limit” likelihood and justifies the use of the adjective ”rescaled”. Indeed:

Proposition 3.2.

The limit imbalanced equations arise from the following rescaled likelihood:

[TABLE]

Proof.

With the initial likelihood

[TABLE]

and the relation $\exp(\beta_{0})=\frac{|n^{1}|}{\sum_{i=1}^{q_{0}}n_{i}^{0}e^{(\mathtt{I}_{0}\tilde{\beta})_{i}}+\sum_{i=1}^{q_{1}}n_{i}^{1}e^{(\mathtt{I}_{1}\tilde{\beta})_{i}}}=\frac{|n^{1}|}{|n^{0}|}C(\frac{|n^{1}|}{|n^{0}|})$ (see 3.3), we obtain the following expression for the likelihood, using notation $x=\frac{|n^{1}|}{|n^{0}|}$ :

[TABLE]

We consider that $|n^{0}|$ is large enough to consider the limit (with $|n^{1}|$ fixed) $"x\to 0"$ and to make the approximations

[TABLE]

and

[TABLE]

thus, using a rescaling term,

[TABLE]

as $\beta_{0}$ tends to minus infinity because $\frac{|n^{0}|}{|n^{1}|}\to+\infty$ . ∎

The reader can see an analogy in physics with the existence of different scales of modelization. For example, the discrete mincroscopic N-body problem changed into the mesoscopic Boltzmann equation using the Boltzmann-Grad limit. See the book [25] for further information on hydrodynamic limits.

This new likelihood makes now possible to consider a wide range of problems, related to variance reduction using simple prior penalties (Subsection 3.4) or regularization (Section 4).

3.3 The relative entropy dual problem

With a likelihood and an entropy, we benefit from two points of view in order to numerically estimate the regression coefficients. The classical approach using a Newton-Raphson algorithm associated to the likelihood can be challenged by other algorithms on the primal or dual problems as described in [20] and [30] for classical logistic regression. We present here the dual problem and its link with initial regression coefficients. We leave the numerical analysis to another study.

Proposition 3.3.

The regression coefficients of the limit imbalanced regression are given by the formulae

[TABLE]

where $n^{*}$ is the probability distribution solving a relative entropy problem with linear constraints

[TABLE]

$A=\sum_{i=1}^{q}n^{*}_{i}\log(\frac{n^{*}_{i}}{\overline{n}_{i}^{0}})$ * and $P=\mathtt{I}-M$ with $P_{ij}=\mathtt{I}_{ij}-\overline{N}^{1}_{j}$ .*

Proof.

With the existence of a unique solution (see Subsection 3.2), there exists a solution $n^{*}\in(\mathbb{R}^{*}_{+})^{q}$ such that $\mathtt{I}^{T}n^{*}=\overline{N}^{1}$ , $\sum_{i}n^{*}_{i}=1$ , and

[TABLE]

Then, using equations (3.1) we obtain

[TABLE]

Let $H$ in $\mathcal{M}_{q\times(q-p-1)}(\mathbb{R})$ be an orthogonal matrix to $I$ (the previous relation remains true with $I$ instead of $\mathtt{I}$ ) and $\gamma\in\mathbb{R}^{q-p-1}$ , such that we can remove $\mathtt{I}$ to obtain the relation

[TABLE]

hence,

[TABLE]

Summing all these relations with weights $n^{*}_{k}+(H\gamma)_{k}$ , using the fact that $\sum_{k=1}^{q}(H\gamma)_{k}=0$ , gives

[TABLE]

Due to convexity of the Kullback-Leibler divergence, we have a unique minimum obtained (by definition of $n^{*}$ ) at $\gamma=0$ . Therefore

[TABLE]

and the result is proved if $P$ is of full rank. Suppose that this is not the case. Then, there exists $\gamma\in\mathbb{R}^{p}\setminus\{0\}^{p}$ such that $P\gamma=(\mathtt{I}-M)\gamma=0$ , therefore $\mathtt{I}\gamma=C$ with $C$ a vector with identical components all equal to $\sum_{i=1}^{p}\overline{N}^{1}_{i}\gamma_{i}$ . Consequently, the matrix $I$ (that is $\mathtt{I}$ with the intercept column of ones) is no more of full rank, which is, by definition of $I=I_{0}$ , impossible. ∎

3.4 Priors for variance reduction and a priori information

The rare events structure of class imbalance goes hand in hand with the problem of precision for estimates. A classical solution consists in introducing an a priori distribution in a Bayesian context. This can be done using a Jeffreys non-informative prior [15] allowing both first order bias removal and variance shrinkage [7]. Thus, we have to maximize the expression

[TABLE]

with $|\mathcal{I}|$ the determinant of the Fisher information matrix. This approach is implemented in the R package logistf for logistic regressions. In the imbalanced case, we search for a method conserving the shape of the limit equations and achieving at the same time variance reduction: we choose the following approximation

[TABLE]

supposing an absence of correlation between predictors in a random design framework (see Section 4). With this hypothesis, we derive first order equations

[TABLE]

thus,

[TABLE]

In table 3, we simulate data sets as previously done with the length for $n^{0}=(n^{0}_{1},...,n^{0}_{10})^{T}$ fixed (to $10$ ) and we compare estimated bias and variance for coefficient $\beta_{4}=0.25$ with three different methods: a classical logistic regression (bias and sd.), the imbalanced case with means $\overline{N}^{1}_{J}$ (im. bias and im. sd.) and the Jeffreys exact penalty (J. bias and J. sd.).

Variance reduction is about 2 percents with the Jeffreys prior and the half as much its easily computable approximation in class imbalance. Bias was already small and gets a little smaller. The shrinkage of the variance is limitated by the Cramér-Rao bound (see Fisher variance in table 1) and no miraculous reduction was conceivable.

In the next section, we consider path following methods to complete regularization and highlight its ”simplicity” with binary data. The initial parameters being the maximum a posteriori estimate (MAP), this estimation is a central problem of the limit imbalanced study. The benefit of the rescaled likelihood compared with the standard one is in the easy use of exponential a priori penalties. Indeed, with the penalty444P could be written as a probability distribution with a normalization term (the support of regression coefficients is finite).

[TABLE]

where $\epsilon\in\mathbb{R}^{p}$ , we maintain the shape of the likelihood by only perturbing the predictor means vector $\overline{N}^{1}$ by $\frac{\epsilon}{|n^{1}|}$ (the MAP exists if and only if $\overline{N}^{1}+\frac{\epsilon}{|n^{1}|}$ is surrounded by the rows of $I$ , see Theorem 3.2).

4 Path estimators for Lasso-type regularization

In ths section, we consider that each observation $\mathtt{I}_{i}$ ( $i\in\{1,...,q\}$ ) is generated by a random binary vector $X_{i}^{T}=(X_{i1},...,X_{ip})^{T}\in\{0,1\}^{p}$ with $\mathbb{E}[X_{ij}]=b_{j}\in\,]0,1[$ , $j\in\{1,...,p\}$ . With this modelization, we find many path estimators depending on the underlying correlation structure of the random design.

4.1 Limit lasso properties

The well-known lasso regularization consists in introducing a positive parameter $\lambda$ defining the strength of a Laplace prior distribution [29]. We search for the maximum of the expression

[TABLE]

which verifies the following simple first order conditions. Notice that we use, from now on, the notation $\beta$ instead of $\tilde{\beta}$ to facilitate the reading.

Proposition 4.1.

The limit imbalanced BLR problem with lasso penalty leads to the system of equations

[TABLE]

with $t=\frac{\lambda}{|n^{1}|}$ and $\nu_{j}(\beta)={\rm sign}(\beta_{j})$ , if $\beta_{j}\neq 0$ , $\nu_{j}(\beta)\in[-1,1]$ , if $\beta_{j}=0$ for all $j\in\{1,...,p\}$ ( $\nu$ is the subgradient of the $l^{1}$ norm).

Thus, the lasso has a clear interpretation as a shift operating on the observed proportions $\overline{N}^{1}$ . Thereafter, we often use the vector $p(t)\in\mathbb{R}^{p}$ defined as $p(t)=\overline{N}^{1}-t\,\nu(\beta)$ .

Proposition 4.2.

If the strong overlap condition in Theorem 3.2 is satisfied, then the function

[TABLE]

is continuous for all $t\geq 0$ and there exists $T\in[0,1]$ , such that $\beta(t)=0\,,\,\forall t\geq T$ .

Proof.

With the positivity of $\cal L$ , we have,

[TABLE]

and for all $t\geq 0$ , $-\log({\cal L})$ is a strictly convex and coercive function in $\beta$ if the strong overlap condition is satisfied (see proof of Theorem 3.2). Therefore, the function $\hat{\beta}$ is well defined for all $t\geq 0$ . Furthermore, this function is continuous because of the continuity in $(\beta,t)$ of $\log(\cal L)$ and its strict concavity in $\beta$ . The equations (4.1) with $t=1$ have no solution if one of the components of $\nu(\beta)$ is equal to $-1$ or $+1$ , therefore $\hat{\beta}_{j}(1)=0$ for all $j\in\{1,...,p\}$ . ∎

Remark 4.1.

Using the law of large numbers, the family of model parameters $\{\beta(t)\}_{t\geq 0}$ solves the system of equations

[TABLE]

with $\mathbb{E}$ being the expectation operator. This previous system of equations takes the same form as in (4.1) because $X_{1}$ is a discrete random vector and therefore the path estimator $\{\beta(t)\}_{t\geq 0}$ is continuous. Notice that the function $\nu\circ\beta$ : $\mathbb{R}^{+}\to\mathbb{R}$ is also continuous in $t$ .

4.2 Path estimators

Thanks to this previous remark, we are able to find precise analytic estimators of the path in the case of independent and orthogonal random designs. Notice that such solutions already exist in the framework of linear regression (see [29]). From now on, the strong overlap condition is considered to be always satisfied at $t=0$ .

Theorem 4.1.

If the random vector $X$ generating the observations $\mathtt{I}$ has independent components, a precise path estimator $\{\hat{\beta}(t)\}_{t\geq 0}$ is given by the formulae

[TABLE]

The coefficients $\hat{\beta}(0)$ are give by the classical MLE (solution of equations (2.2) without intercept) if we want to estimate the path obtained by an (imbalanced) logisitic regression. If we use the limit equations, we need the MLE of the rescaled likelihood (Proposition 3.2 and equations (3.1)) and in this case:

[TABLE]

Proof.

For all $j\in\{1,...,p\}$ we use the hypothesis of independence:

[TABLE]

and the solution is

[TABLE]

and $\beta_{j}(t)=0$ if $t>|\overline{N}_{j}^{1}-b_{j}|$ . Indeed, $\dot{\beta_{j}}$ is negative in region $\beta_{j}>0$ and positive in region $\beta_{j}<0$ . $\beta_{j}(0)=\log\left(\frac{\overline{N}_{j}^{1}}{1-\overline{N}_{j}^{1}}\frac{1-b_{j}}{b_{j}}\right)$ for a random design with independent predictors (see Appendix A.2 with $p=1$ ). We replace all the $b_{j}$ by the frequencies of observations $\overline{N}^{0}_{j}$ to obtain the estimator. ∎

The orthogonal case, when the inner product between columns of the design matrix vanishes ( $X_{1j}X_{1k}=0$ , $j\neq k$ ), is also tractable.

Theorem 4.2.

If the random design is orthogonal, we have $\mathtt{I}\in\mathcal{M}_{(p+1)\times p}(\mathbb{B})$ filled by zeros except at positions $(i+1,i)$ , $i=1,...,p$ and the derivative of the path estimator takes the form

[TABLE]

with $\{S_{t}\}_{t\geq 0}$ a family of subsets of $\{1,...,p\}$ containing the indexes of non-zero coefficients of vector $\beta$ at time $t\geq 0$ . The algorithm that describes the positions of the change-points in $S_{t}$ is described in the proof.

Proof.

With the hypothesis of orthogonality, equations (4.2) are reducted to

[TABLE]

and we obtain

[TABLE]

Let $\overline{S}_{t}=\{0,1,...,p\}\setminus S_{t}$ and $\overline{S}^{*}_{t}=\overline{S}_{t}\setminus\{0\}$ , then

[TABLE]

After computation, we have explicit formulae for the continuous functions $\beta_{i}$ and $\nu_{i}$ ( $i\in\{1,...,p\}$ ):

[TABLE]

with $b^{S}=\sum_{s\in\overline{S}_{t}}b_{s}$ , $\overline{N}^{S}=\sum_{s\in\overline{S}_{t}}\overline{N}^{1}_{s}$ and $R^{S}=\sum_{s\in S_{t}}{\rm sign}(\beta_{s})$ . These functions are monotonous, we need the change-points to draw the path, that is the finite sequence of different models $\{S_{t}\}_{t\geq 0}=\{S_{t_{0}},S_{t_{1}},...,S_{t_{m}}\}$ , $m\in\mathbb{N}^{*}$ . For all $i\in\{0,...,m-1\}$ , $\{S_{t}\}_{t\in[t_{i},t_{i+1}[}$ is a unique subset. If $t\in[t_{i},t_{i+1}[$ and we know $S_{t}$ we determine $S_{t_{i-1}}$ , $S_{t_{i+1}}$ , $t_{i}$ and $t_{i+1}$ by solving

[TABLE]

We define $W=\{w_{i}\}=\{u_{i},v_{j}^{+},v_{j}^{-}\,,\,i\in S_{t}\,,\,j\in\overline{S}^{*}_{t}\}$ and the two adjacent change-points are given by

[TABLE]

Therefore,

[TABLE]

with $U_{i}=\{j\in\{1,...,p\}\,|\,u_{j}=t_{i}\}\,,V_{i}=\{j\in\{1,...,p\}\,|\,v_{j}^{+}=t_{i}\,\,{\rm or}\,\,v_{j}^{-}=t_{i}\}$ .

The path can be built forward or backward. If we choose the path following approach (forward), $S_{t_{0}}$ is found using the MLE of the rescaled likelihood (see Section 3) and $t_{0}=0$ . In the other configuration (backward), we have $S_{t_{m}}=\emptyset$ and for $t>t_{m}$ , $b^{S}=\overline{N}^{S}=1$ and $R^{S}=0$ , so that $t_{m}=\max\limits_{i\in\{1,...,p\}}|\overline{N}^{1}_{i}-b_{i}|$ . ∎

Simulations with this type of design show that each path usually vanishes only one time (and does not reappear) and thus $m>p$ is a very rare (impossible?) configuration.

The opposite situation to orthogonality is inclusion. For example, if $X_{12}$ is included in $X_{11}$ meaning that for the observed data $\mathtt{I}_{i1}=1$ if $\mathtt{I}_{i2}=1$ , we find an analytic description of the estimator given by the formulae

[TABLE]

This solution is likely generalizable (with a design in stairs as presented in Appendix A.4), however, this case is meaningless in the analysis of spontaneous reports databases and then left aside.

We give examples of plots of path estimates compared with a standard (using $L$ not $L^{*}$ ) lasso path for different imbalance strengths in appendix C. The results highlight the high quality of the analytic path estimators, even in absence of class imbalance.

Remark 4.2.

Another regularization method is called the elastic net penalization and uses, in addition to the lasso, a second penalized term of ridge (or Tikhonov) kind [31]:

[TABLE]

with $\alpha\in]0,1]$ . In the case of independence in random vector $X$ , we have an explicit formula for $t$ with respect to $\beta$ :

[TABLE]

for $\hat{\beta}$ between [math] and $\hat{\beta}(0)$ . The coefficients vanish when $t_{0}^{en}=\frac{1}{\alpha}t_{0}^{lasso}\,.$ The proof of this result is a simple adaptation of the proof for the lasso in Theorem 4.1.

4.3 Negative correlation structure

If the random design verifies the relations $\mathbb{E}[X_{1j}X_{1k}e^{X_{1}\beta}]\mathbb{E}[e^{X_{1}\beta}]\leq\mathbb{E}[X_{1j}e^{X_{1}\beta}]\mathbb{E}[X_{1k}e^{X_{1}\beta}]$ , $\forall j\neq k$ , $\forall t\geq 0$ , this in-between situation of a $\beta$ -dependent negative correlation between variables $X_{j}$ ( $j=1,...,p$ ) is also tractable and particularly interesting in the sparse context of near-zero components for vector $\overline{N}^{1}$ 555Spontaneous reports databases are an example of such a sparsity with negative correlation.. We find two estimators that sourrunded the real path.

Theorem 4.3.

The path estimator in the $\beta$ -dependent negative correlation case is surrouned by estimators, whose derivatives are given by

[TABLE]

with $S_{t}^{+}=\{j\in S_{t}\,|\,{\rm sign}(\beta_{j})>0\}$ and $S_{t}^{-}=\{j\in S_{t}\,|\,{\rm sign}(\beta_{j})<0\}$ . With the rare occurrence of resurgence of a coefficient after vanishing, we neglect this possibility and we easily find the $p$ vanishing points and thus the family of subsets $\{S_{t}\}_{t\geq 0}$ .

Proof.

We differentiate equations (4.1) with respect to $t$ considering only the equations verifying the condition $\beta_{j}(t)\neq 0$ , i.e. $j\in S_{t}$ . We obtain at time $t$ ,

[TABLE]

or written differently,

[TABLE]

where $R_{jk}(t)=\frac{\sum_{i}\mathtt{I}_{ij}\mathtt{I}_{ik}\overline{n}^{0}_{i}e^{(\mathtt{I}\beta)_{i}}}{\sum_{i}\overline{n}^{0}_{i}e^{(\mathtt{I}\beta)_{i}}}$ is a t-dependent proportion of rows with a one on the columns $j$ and $k$ . With only negative correlations or independence between components of $X$ , we define the matrix $F(t)\in\mathcal{M}_{r\times r}([0,1])$ with $r=\#S_{t}$ as long as $R_{jk}(t)\leq p_{j}(t)p_{k}(t)$ ,

[TABLE]

if observations $\mathtt{I}$ give such a matrix $F$ . We obtain $(\mathbb{I}_{r}-(D-F)P)\dot{\beta}=P^{-1}\dot{P}(1)$ with $P$ a diagonal matrix filled with the elements $\{p_{i}(t)\,,\,i\in S_{t}\}$ . Matrix $D$ is the correlation-track matrix containing ones at positions $(j,k)$ if $F_{jk}(0)<1$ and we have666the non-singularity of the matrix $C(t)$ in (4.7), $C(t)\dot{\beta}=\dot{\overbrace{\log p(t)}}$ , will be proven with Proposition 5.1.

[TABLE]

so that, using the positivity of all the elements in matrix $D-F$ :

[TABLE]

with $P^{+}$ the diagonal matrix filled with vector $p^{+}(t)=(\max(sign(\beta_{i}),0))p_{i}(t))_{i\in S_{t}}$ and $P^{-}$ with vector $p^{-}(t)=(\max(-sign(\beta_{i}),0))p_{i}(t))_{i\in S_{t}}$ . Finally,

[TABLE]

∎

In presence of sparsity (small components in $\overline{N}^{1}$ ), $0<p_{j}(t)\ll 1-(Dp^{-}(t))_{j}(t)$ and $0<p_{j}(t)\ll 1-(Dp^{+}(t))_{j}(t)$ , which makes previous upper and lower bounds good path estimators. The $p$ (or more) change-points are determined step by step as in previous subsection and the estimated path $\beta(t)$ is stucked between a lower path and an upper paths.

5 Efficient algorithms for Lasso regularization

In this last section, we propose two new algorithms drawing piecewise logarithmic approximate paths derived from a small amount of matrix inversions ( $p$ or more). The logarithmic function naturally arised in the expression of all previously found path estimators, consequently, we build approximations involving this function. The main benefit of our algorithms is the direct computation of the sequence $\{t_{i}\}$ as done by the LARS [5] for linear regression. Our first algorithm follows the path ( $t$ increases) and is a simplified procedure adapted to data with a low correlation structure. The second algorithm is a backward procedure ( $t$ decreases toward zero) and can challenge the classic coordinate descent approach [9]. The efficiency of the algorithms are eventually illustrated on pharmacovigilance data.

5.1 Cauchy problem

The derivative of the first order equations for the Lasso with respect to $t$ leads to a Cauchy problem.

Proposition 5.1.

The Lasso regularization path is described by the following system of differential equations

[TABLE]

with $C(t)\in\mathcal{M}_{r_{t}\times r_{t}}(\mathbb{R})$ ( $r_{i}=\#S_{t}$ ), $\beta\in\mathbb{R}^{r_{t}}$ , $\log p(t)\in\mathbb{R}^{r_{t}}$ and

[TABLE]

Proof.

Equations (4.7) are divided by vector $p(t)$ and we obtain the desired equations. It remains to be proven the non-singularity of matrix $C(t)$ for all $t>0$ .

With diagonal matrix $P\in\mathcal{M}_{r_{t}\times r_{t}}(\mathbb{R})$ filled by elements $(p_{j})_{j\in S_{t}}$ we build a matrix $\tilde{C}=PC$ whose elements are:

[TABLE]

Suppose that this matrix $\tilde{C}(t)$ is singular, then there exists a non-identically null vector $\gamma\in\mathbb{R}^{r_{t}}$ such that $\tilde{C}(t)\gamma=0$ or written component-by-component

[TABLE]

We compute the linear combination $\sum_{j}\gamma_{j}(\tilde{C}\gamma)_{j}=0$ to obtain after computations

[TABLE]

with $J_{u}=\sum_{l}\mathtt{I}_{ul}\gamma_{l}$ . This relation is expanded and simplified into

[TABLE]

This is a sum of positive terms equals to zero, meaning that each term wanishes and we get $J_{u}=const$ for all $u=1,...,n$ . Thus $\mathtt{I}\gamma=const$ which is impossible because matrix $I$ is a full rank matrix. ∎

5.2 The piecewise logarithmic approximate path : a first simple algorithm

Path following algorithms [23] are competing methods with more used coordinate descent algorithms [9] [10]. We here present a simple algorithm for an increasing regularization parameter $t$ . Within this procedure, we are able to estimate at each step the value $t$ of the next wanishing component in vector $\beta(t)$ and thus speeding up the classical Newton-Raphson step [23]. We consider that correlation between predictors is ”low”, so that an emergence of a coefficient along the path after wanishing is not taken into account (but this case is included in the second algorithm).

Proposition 5.2.

*The path following algorithm for limit imbalanced logisitic regression by binary predictors (with low correlation) is the following:

$i=0$ , $t_{0}=0$ , $\beta(t_{0})=\beta(0)$ given. $S_{0}=\{j\,|\,\beta_{j}(0)\neq 0\,,\,j=1,...,p\}$ .

WHILE $r_{i}=\#S_{t_{i}}\neq 0$ DO*

[TABLE]

with $C_{i}\in\mathcal{M}_{r_{i}\times r_{i}}(\mathbb{R})$ such that

[TABLE]

The path, on the segment $[t_{i},t_{i+1}]$ , is given by

[TABLE]

$i$ * becomes $i+1$ .

END DO.*

Proof.

Equations (4.7) take the form $C(t)\dot{\beta}=\dot{\overbrace{\log\left(p(t)\right)}}$ with $C(t)$ called correction matrix.

[TABLE]

Between two annulations of regression coefficients along the path ( $t_{i}$ and $t_{i+1}$ ), we consider this matrix to be constant ( $C(t_{i})=C_{i}$ ). In this case,

[TABLE]

We have $t_{0}=0$ , but the sequence of values $\{t_{i}\}$ is unknown. However, we iteratively approximate them as follows. With

[TABLE]

because $|C_{i}^{-1}\left(\frac{{\rm sign}(\beta)}{p(t_{i})}\right)(t_{i+1}-t_{i})|$ is small for relative small step $t_{i+1}-t_{i}$ . We obtain the piecewise logarithmic path:

[TABLE]

with

[TABLE]

$\Delta T_{i}$ is the set of values for $t_{i+1}-t_{i}$ solving (5.2) with $\beta_{j}(t_{i+1})=0$ (for each $j\in S_{t_{i}}$ ). The set $U_{i}$ gives at each step the indexes of regression coefficients to remove from $S_{t_{i}}$ . ∎

Other approximations could be performed, for example using a second order term in the previous approximation (5.2). Simulation tests show that our choice seems to give better results. We notice that the size of the matrix $C_{i}$ decreases during this procedure, speeding up the computation at each new step $t_{i}$ .

Remark 5.1.

This algorithm has two main computational advantages. Firstly, the sequence $\{t_{i}\}_{i=1,...,p}$ is directely determined, whereas other algorithms use a regular discretization on a logarithmic scale (coordinate descent) or Newton-Raphson steps (path following). Secondly, the sum $\sum_{i}\overline{n}^{0}_{i}e^{(\mathtt{I}\beta)_{i}}$ does not appear in the $C_{i}$ matrices, which can highly reduce the computational cost especially if the matrix $\mathtt{I}$ is sparse ( $0.03\%$ of ones in the French spontaneous reports data base): this algorithm handles sparsity!

To explore the efficiency of the algorithm, we simulate data sets with different correlation structures. Model selection is often provided with the BIC [26], which requires to know the different models arising along the path. Hence, we decide to evaluate the algorithm accuracy using a simple indicator: a comparison of the sequence of coefficients in the order of wanishing along the path. The indicator is $p^{\prime}/p$ if a simulation with our algorithm gives $p^{\prime}$ coefficients at the same index as in the sequence obtained by a classical lasso algorithm (coordinate descent in R package glmnet). The correlation coefficient (from $r=0$ to $r=0.9$ ) means that we chose initial $R_{jk}=(1-r)b_{j}b_{k}+r\min(b_{j},b_{k})$ .

We simulate $10^{3}$ paths for each number $nb$ and $r$ , $nb$ being the number of predictors in correlation. For each path, $\beta_{0}=-5$ and the $10$ regression coefficients $(\beta_{1},...,\beta_{10})$ are always the same and chosen on a regular scale between $-0.5$ and $0.5$ .

With $nb=3,5$ or $8$ correlated predictors over the 10 used, the exact solution with assumption of independence (i) (see Theorem 4.1) deteriorates with the increase in correlation (r), which is (almost) not the case if we use our algorithm (a). Notice that, with a result around $0.8$ , the approximate path is often very close to the exact one, this is due to the inversion in the sequence of two close $t_{i}$ terms (see (5.1)).

5.3 A new algorithm

The second algorithm presented in this section computes forward selection. It is more suitable for problems with a large number of predictors (when we are looking for a sparse model) or/and in presence of a strong correlation structure.

The standard approach for computing regularization path by decreasing $t$ with logistic regression consists in using a first order quadratic approximation of the first derivative of the likelihood between two consecutive closed solutions (that is in practice, two parameters $t_{i}$ and $t_{i+1}$ such that $t_{i+1}-t_{i}<0$ is small). Using small steps for the parameter sequence $\{t_{i}\}$ to ensure a good approximation, the path is drawn by the cyclical coordinate method (see [9] and the R package glmnet). Our new algorithm is a kind of equivalent of the LARS algorithm for the logistic regression : we compute large step in $t$ . Furthermore, in comparison with the cyclic coordinate descent algorithm, there is no loop at a fixed parameter $t$ . After presenting the algorithm, we challenge the glmnet package with our approach.

Proposition 5.3.

*The backward algorithm for limit imbalanced logisitic regression by binary predictors is the following:

$i=0$ , $t_{0}=\max_{i}\{|\overline{N^{1}_{i}}-\overline{N^{0}_{i}}|\}=|\overline{N^{1}_{k}}-\overline{N^{0}_{k}}|$ , $\beta(t_{0})=(0,...,0)^{T}\in\mathbb{R}^{p}$ and $\epsilon>0$ given. $S_{t_{0}}=\{\beta_{k}\}$ .

WHILE ( $t_{i}>\epsilon$ or $\#S_{t_{i}}<p$ ) DO*

[TABLE]

with

[TABLE]

and

[TABLE]

Definitions for matrices $\Phi$ and $\Psi$ are given in the proof. Notice that $\Phi=\Phi(t_{i},\beta(t_{i}))$ (as for $\Psi$ ). The path, on the segment $[t_{i+1},t_{i}]$ , is given by

[TABLE]

and for the subgradients

[TABLE]

The new set $S_{t_{i+1}}$ is given by

[TABLE]

with

[TABLE]

$i$ * becomes $i+1$ .

END DO*

Proof.

We differentiate equations (4.1) for all $j$ in $\{1,...,,p\}$ (see also (4.7)):

[TABLE]

or in matrix form with $C(t)\in\mathcal{M}_{r\times r}(\mathbb{R})$ , $D(t)\in\mathcal{M}_{(p-r)\times r}(\mathbb{R})$ , $r=\#S_{t}$ and vectors $p^{\neq}(t)=(p_{j}(t))_{j\in S_{t}}^{T}$ , $p^{=}(t)=(p_{j}(t))_{j\in\overline{S}_{t}}^{T}$ and $\beta^{\neq}(t)=(\beta_{j}(t))_{j\in S_{t}}^{T}$ we get

[TABLE]

$C(t)$ is a square non-singular matrix for all $t$ in $[0,t_{0}[$ (see Remark 5.1). Between two consecutive values $t_{i}$ and $t_{i+1}$ ( $t_{i+1}<t_{i}$ ) of the $t$ sequence, we consider that $C(t)\approx C(t_{i})$ and $D(t)\approx D(t_{i})$ , thus

[TABLE]

with $E(t_{i})\in\mathcal{M}_{(p-r_{i})\times r_{i}}(\mathbb{R})$ and $r_{i}=\#S_{t_{i}}$ . The system of equations involving matrix $C^{-1}$ is solved as in the proof of Proposition 5.2 and we get

[TABLE]

with $\Phi_{ij}=\left(C_{i}^{-1}(\frac{{\rm sign}(\beta^{\neq})}{p^{\neq}(t_{i})})\right)_{j}$ . The second set of equations gives

[TABLE]

and using the usual approximation

[TABLE]

with $\Psi_{ij}=\left(E_{i}(\frac{{\rm sign}(\beta^{\neq})}{p^{\neq}(t_{i})})\right)_{j}$ and we find

[TABLE]

We solve $2r_{i}+(p-r_{i})=p+r_{i}$ equations ( $\nu_{j}(t_{i+1})=\pm 1$ , $j\in\overline{S}^{*}_{t_{i}}$ and $\beta_{j}(t_{t+1})=0$ , $j\in S_{t_{i}}$ ) to find the possible values for $t_{i+1}-t_{i}$ . The maximum of obtained negative values within the $p+r_{i}$ results is used to build the $t$ sequence. ∎

To visualize what is happening during the algorithm, we define linear functions $B^{=}_{j}:t\mapsto t\nu_{j}(t)$ and $B^{\neq}_{j}:t\mapsto e^{\beta_{j}(t)}-1+{\rm sign}(\beta_{j})t$ leading to the $p$ functions $B_{j}$ ( $j=1,...,p$ ) such that

[TABLE]

Functions $B_{j}$ are all piecewise linear and can be drawn in the plane shown in Figure 1.

5.4 Path reconstruction with the French spontaneous reports database

We illustrate the efficiency of the limit path construction by piecewise logarithmic functions on the French spontaneous reports database. We look at two examples, a first one with no evidence of correlation and a second with strong correlations. The database contains about 330000 reports in 2016 and the imbalance is high or very high for all the adverse effects [2]. In the following graphs, the dotted lines represent results obtained by our algorithm, the solid ones result from the classical glmnet package.

The Figure 3 shows common features encountered with other examples. The path of the exponential of the coefficients shapes a set of piecewise linear functions and the algorithm remains efficient even if the number of predictors is high (150 for examples). It seems that there is no case of a path with a curve reappearing after a first canceling (due to a strong correlation between predictors with opposite signs of initial coefficients). Thus, the sets $A_{i}$ in the algorithm do not have to be determined.

We notice that the accuracy of this path following algorithm can easily be increased by adding intermediate steps (in variable t). The main computing limitation being the matrix inversion, one could study the inner product (Gram) matrix for class 0 and reorder rows and columns to reveal patterns and form a block diagonal matrix. These blocks could result from a statistical study of the Gram matrix777To that end, see the literature of the block clustering problem [11]. (finding the pairwise independent predictors) as well as from pharmacological assumptions (medical treatments also shape patterns). Thus, computational costs become a marginal problem and one can concentrate on the bias correction by adding priors related to temporal bias, under-reporting or the introduction of similarity modifying the $R$ matrix888With a similarity matrix $S\in\mathcal{M}_{p\times p}([0,1])$ , the similarity is defined as follows: coefficient $R_{jk}$ becomes $(1-S_{jk})R_{jk}+S_{jk}\min(p_{j},p_{k})$ ..

6 Conclusion and perspectives

The central novelty of this work is the introduction of a rescaled likelihood for the limit imbalanced logistic regression problem. The expression of this likelihood could have some connexions with the well-known likelihoods of the self-controlled case series method [28] and of the proportional hazards model [4] used in epidemiology.

Most results exposed for binary data can be extented to other data types. However, simulations have been done only with binary data, having in mind the underlying applied problem of pharmacovigilance. The new estimate is always very close to the initial MLE because data are located on the vertexes of the hypercube and then one another ”close”. A convergence study of all possible existing algorithms for the primal and dual problems could be performed with different class imbalances and an evaluation of the first order term.

The variance reduction is a central issue that has to be treated in a Bayesian framework. Whereas the prior to add in the standard logistic regression is unclear, the rescaled likelihood takes a well-adapted form for exponential priors. We considered model selection using the BIC and the lasso to answer this question. Due to binary data, the lasso regularization problem became easier to understand in our limit imbalanced case: we found many precise estimators. Piecewise logarithmic approximate paths are built by an effective path following procedure which determines step by step the vanishing time of each path, do not use any loops as in coordinate descent algorithms and computes expressions only involving non-zero data. Moreover, this algorithm can take into account the correlation structure between predictors to further shrink computational costs. The values for $\overline{N}^{1}$ , $\overline{n}^{0}$ and for matrix $R$ could be shifted in order to incorporate absolute bias, temporal bias, under-reporting and similarity or correlation corrections.

7 A pharmacovigilance project?

Within this paper, we have had in mind the pharmacovigilance context as this work was carried out in parallel of a one-year engineering job at the French National Institute of Health and Medical Research999B2PHI laboratory UMR 1181, INSERM, UVSQ, Institut Pasteur, Villejuif 94807, France. We hope this article could contribute a little to the developement of mathematical tools for pharmacovigilance purposes. The science of drug safety at a postmarketing level is nearly non-existent in France as in many other countries: the reporting process of spontaneous reports is inadequate and resulting databases are badly processed with unadapted tools. Public health scandals related to medication are steadily increasing and the spotlights are turned towards big pharmaceutical companies while patient associations should firstly require public authorities to establish a modern drug safety structure. To that end, the statistical community has a major role to play by proposing trustworthy decision-support tools, opposing science to political and financial influences. Creating a useful tool was the guideline of this present work and the author hopes that other mathematicians will embrace the direction initiated by this article.

We would like to conclude by giving our opinion about the work that remains to be done to obtain an operational tool (in five points), hoping that it will inspire epidemiologists.

Building priors related to bias (temporal bias, under-reporting…) with the help of pharmacologists. 2) Developing the proposed regularization algorithms evaluating their complexity and accuracy levels. 3) Introducing simple indicators to control the quality of the limit approximation. 4) Working on path visualization and new indicators (that are not thresholds). 5) Evaluating the obtained tool in the hands of pharmacologists (the use of reference sets is, to our mind, inadequate).

Acknowlegment

I would like to deeply thank Laetitia Comminges from the Paris-Dauphine University for relevant comments that greatly improved the manuscript. I also thank my colleague Mohammed Sedki from the INSERM laboratory of Villejuif for his constant encouragement to complete this work.

Appendix A Exact solutions

We give a collection of examples consisting in simple solutions of the equation (2.3).

A.1 No intercept

If there is no intercept and no interaction between the regressors, the matrix $\mathrm{I}$ equals the identity matrix $\mathbb{I}_{p}$ and

[TABLE]

If one row contains other ones, the inverse matrix is the same matrix with the added ones transformed into its opposite.

A.2 Intercept

If the square matrix $\mathrm{I}_{p+1}$ is the following

[TABLE]

for the inverse matrix, so that the $\beta$ coefficients take the form

[TABLE]

A.3 Intercept with one correlation

The first row of the following matrix

[TABLE]

definies the set $K=\{j\in\{0,...,p\}\,,\,\mathrm{I}_{1j}=1\}$ . The case $\#J=2$ is left out because it does not coincide with a non-singular matrix. The case $\#J=1$ corresponds to the previous example. The easiest way to solve this example is to look at initial equations (2.1). We write down the $p+1$ equations, where only the first one has a different form:

[TABLE]

and for $k\in\{1,...,p\}$ ,

[TABLE]

Subtracting all the $p$ equations to the first one, we obtain

[TABLE]

which can be used to simplify equations (A.1) into

[TABLE]

Finally, we have

[TABLE]

so that we deduce the following closed form for the $\beta$ coefficients

[TABLE]

Notice that the regression coefficients behave in a very unpredictable way. It is sufficient to see that on an example with $p=2$ and $\#J=3$ . The matrix $\mathrm{I}$ is

[TABLE]

and we have

[TABLE]

The first intuition is to think that coefficients $\beta_{1}$ and $\beta_{2}$ do depend on the couples $(n_{1}^{0},n_{1}^{1})$ and $(n_{2}^{0},n_{2}^{1})$ respectively, but it is not the case!

A.4 Stairs

With

[TABLE]

and then

[TABLE]

Appendix B Proof of Proposition 3.1.

The result is proved with a succession of Taylor expansions of degree $1$ or $2$ in $1/s$ . We use

[TABLE]

then with (2.2), we have

[TABLE]

The first equation of this system gives

[TABLE]

therefore, using notations introduced in the proposition,

[TABLE]

and

[TABLE]

The coefficient $s$ is defined as $s=\frac{|n^{0}|}{|n^{1}|}$ . Now we have to find the Taylor expansion of $e^{\beta_{0}}$ of degree two in $1/s$ and reinject it in (B.1). We have

[TABLE]

and

[TABLE]

Therefore

[TABLE]

The system of equations (B.1) without its first equation is

[TABLE]

and we use the previous expression for $e^{\beta_{0}}$ :

[TABLE]

or

[TABLE]

Then, using again a Taylor expansion,

[TABLE]

or

[TABLE]

Appendix C Path simulations

In the following graphs, the dotted lines are obtained by the exact path corresponding to Theorem 4.1 for examples 1 and 2, Theorem 4.2 for examples 3 and 4 and the inclusion case for examples 5 and 6. The solid lines are always given by a coordinate descent algorithm for the standard logistic regression. We change the scale for $t$ (by a linear rescaling) in order to have to same $\max(t)$ (see (5.1)) for the exact and algorithmic paths.

Bibliography31

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Ahmed Ismail, Pariente Antoine, Tubert-Bitter Pascale (2016) Class-imbalanced subsampling lasso algorithm for discovering adverse drug reactions. Statistical Methods in Medical Research.
2[2] Beziz et al. (2016) Spontaneous adverse drug reaction reporting in France: A retrospective analysis of reports made to the French medicines agency from 2002 to 2014. Revue d’Épidémiologie et de Santé Publique , 64 .
3[3] Caster et al. (2010) Large-Scale Regression-Based Pattern Discovery: The Example of Screening the WHO Global Drug Safety Database. Stat. Anal. Data Min. , 3 , no. 4, 197–208.
4[4] Cox David Roxbee (1975) Partial likelihood. Biometrika , 62 , no. 2, 269–276.
5[5] Efron Bradley, Hastie Trevor, Johnstone Iain, Tibshirani Robert (2004) Least angle regression. Annals of statistics , 32 , no. 2, 407–499.
6[6] Elrahman Shaza, Abraham Ajith (2013) A Review of Class Imbalance Problem. Journal of Network and Innovative Computing. , 1 , 332–340.
7[7] Firth David (1993) Bias reduction of maximum likelihood estimates. Biometrika , 80 , no. 1, 27–38.
8[8] Fithian William, Hastie Trevor. (2014) Local case-control sampling: efficient subsampling in imbalanced data sets. Ann. Statist. , 42 , no. 5, 1693–1724.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

The Limit Imbalanced Logistic Regression by Binary Predictors and its fast Lasso computation

Abstract

1 Introduction

2 The logistic regression by binary predictors

2.1 Logistic normal equations

Remark 2.1**.**

2.2 Existence and uniqueness

Theorem 2.1**.**

Proof.

2.3 The square case

Theorem 2.2**.**

Proof.

Remark 2.2**.**

2.3.1 Invariance if intercept

Proposition 2.1**.**

Proof.

2.3.2 Asymptotic variance

Proposition 2.2**.**

Proof.

Remark 2.3**.**

3 Limit imbalanced study

3.1 Owen-type equations

Proposition 3.1**.**

Theorem 3.1**.**

Proof.

3.2 Strong overlap condition and rescaled likelihood

Theorem 3.2**.**

Remark 3.1**.**

Lemma 3.1**.**

Proof.

Lemma 3.2**.**

Proof.

Lemma 3.3**.**

Proof of the theorem.

Proposition 3.2**.**

Proof.

3.3 The relative entropy dual problem

Proposition 3.3**.**

Proof.

3.4 Priors for variance reduction and a priori information

4 Path estimators for Lasso-type regularization

4.1 Limit lasso properties

Proposition 4.1**.**

Proposition 4.2**.**

Proof.

Remark 4.1**.**

4.2 Path estimators

Theorem 4.1**.**

Proof.

Theorem 4.2**.**

Proof.

Remark 4.2**.**

4.3 Negative correlation structure

Theorem 4.3**.**

Proof.

5 Efficient algorithms for Lasso regularization

5.1 Cauchy problem

Proposition 5.1**.**

Proof.

5.2 The piecewise logarithmic approximate path : a first simple algorithm

Proposition 5.2**.**

Proof.

Remark 5.1**.**

5.3 A new algorithm

Proposition 5.3**.**

Proof.

5.4 Path reconstruction with the French spontaneous reports database

6 Conclusion and perspectives

7 A pharmacovigilance project?

Acknowlegment

Appendix A Exact solutions

A.1 No intercept

A.2 Intercept

Remark 2.1.

Theorem 2.1.

Theorem 2.2.

Remark 2.2.

Proposition 2.1.

Proposition 2.2.

Remark 2.3.

Proposition 3.1.

Theorem 3.1.

Theorem 3.2.

Remark 3.1.

Lemma 3.1.

Lemma 3.2.

Lemma 3.3.

Proposition 3.2.

Proposition 3.3.

Proposition 4.1.

Proposition 4.2.

Remark 4.1.

Theorem 4.1.

Theorem 4.2.

Remark 4.2.

Theorem 4.3.

Proposition 5.1.

Proposition 5.2.

Remark 5.1.

Proposition 5.3.