On the Limit Imbalanced Logistic Regression by Binary Predictors
Vincent Runge

TL;DR
This paper proposes a rescaled likelihood approach for imbalanced logistic regression with binary predictors, facilitating regularization and interpretation, especially useful in pharmacovigilance data analysis.
Contribution
It introduces a novel rescaled likelihood that simplifies regularization and interpretation in imbalanced logistic regression with binary predictors.
Findings
Convergence of maximum likelihood estimates under class imbalance with strong overlap conditions.
Analytic solutions for lasso regularization paths in binary predictor models.
An efficient approximate path algorithm based on matrix inversions.
Abstract
In this work, we introduce a modified (rescaled) likelihood for imbalanced logistic regression. This new approach makes easier the use of exponential priors and the computation of lasso regularization path. Precisely, we study a limiting behavior for which class imbalance is artificially increased by replication of the majority class observations. If some strong overlap conditions are satisfied, the maximum likelihood estimate converges towards a finite value close to the initial one (intercept excluded) as shown by simulations with binary predictors. This solution corresponds to the extremum of a concave function that we refer to as "rescaled" likelihood. In this context, the use of exponential priors has a clear interpretation as a shift on the predictor means for the minority class. Thanks to the simple binary structure, some random designs give analytic path estimators for the lasso…
| -7 | -6 | -5 | -4 | -3 | -2 | -1 | 0 | ||
|---|---|---|---|---|---|---|---|---|---|
| 1052 | 385 | 142 | 52 | 19 | 7.1 | 2.7 | 1.0 | ||
| sd. | . | . | . | . | 0.768 | 0.467 | 0.333 | 0.286 | |
| F.sd. | . | . | . | . | 0.762 | 0.464 | 0.334 | 0.293 | |
| bias | . | . | . | . | 0.019 | 0.0037 | 0.0065 | -0.0021 | |
| sd. | . | . | 0.633 | 0.369 | 0.224 | 0.145 | 0.105 | 0.0931 | |
| F.sd. | . | . | 0.618 | 0.361 | 0.221 | 0.145 | 0.104 | 0.0918 | |
| bias | . | . | 0.019 | 9.4e-3 | -2.4e-4 | 1.e-3 | -3.0e-4 | -2.9e-4 | |
| sd. | 0.528 | 0.303 | 0.187 | 0.110 | 0.0690 | 0.0450 | 0.0328 | 0.0300 | |
| F.sd. | 0.512 | 0.301 | 0.183 | 0.111 | 0.0685 | 0.0451 | 0.0327 | 0.0291 | |
| bias | 0.020 | 5.4e-4 | 3.4e-3 | -3.2e-5 | -5.6e-5 | 4.5e-4 | -2.8e-5 | 1.9e-4 |
| -5 | -4 | -3 | -2 | -1 | 0 | ||
|---|---|---|---|---|---|---|---|
| 141 | 51 | 19 | 7.0 | 2.6 | 1.0 | ||
| sd. | 0.3614 | 0.2154 | 0.1331 | 0.08765 | 0.06384 | 0.05730 | |
| sd. imb. | 0.3614 | 0.2154 | 0.1331 | 0.08766 | 0.06402 | 0.05757 | |
| \cdashline2-8 | bias | 2.533e-3 | 7.572e-4 | -9.553e-4 | 1.407e-3 | 2.797e-4 | 4.954e-4 |
| bias imb. | 2.532e-3 | 7.527e-4 | -9.433e-4 | 1.439e-3 | 3.261e-4 | 5.073e-4 | |
| \cdashline2-8 | 4.702e-4 | 6.849e-4 | 1.089e-3 | 1.762e-3 | 2.869e-3 | 4.806e-3 | |
| sd. | 0.2643 | 0.1583 | 0.09879 | 0.06468 | 0.04839 | 0.04382 | |
| sd. imb. | 0.2643 | 0.1583 | 0.09879 | 0.06474 | 0.04860 | 0.04451 | |
| \cdashline2-8 | bias | 2.148e-3 | 1.956e-3 | -1.583e-5 | 7.601e-4 | 2.461e-4 | -6.280e-4 |
| bias imb. | 2.130e-3 | 1.967e-3 | -8.450e-6 | 7.752e-4 | 3.000e-4 | -5.337e-4 | |
| \cdashline2-8 | 5.950e-4 | 8.864e-4 | 1.399e-3 | 2.281e-3 | 3.698e-3 | 5.952e-3 | |
| sd. | 0.2438 | 0.1470 | 0.09346 | 0.06112 | 0.04585 | 0.04090 | |
| sd. imb. | 0.2438 | 0.1471 | 0.09349 | 0.06117 | 0.04628 | 0.04171 | |
| \cdashline2-8 | bias | 4.385e-3 | 1.034e-3 | 8.025e-4 | 3.584e-4 | 1.259e-3 | -1.905e-4 |
| bias imb. | 4.383e-3 | 1.048e-3 | 7.894e-4 | 3.537e-4 | 1.313e-3 | -4.297e-5 | |
| \cdashline2-8 | 6.220e-4 | 9.239e-4 | 1.472e-3 | 2.341e-3 | 3.920e-3 | 6.306e-03 |
| -5 | -4 | -3 | -2 | -1 | 0 | |
|---|---|---|---|---|---|---|
| 143 | 52 | 19 | 7.1 | 2.7 | 1.0 | |
| sd. | 0.3720 | 0.2173 | 0.1328 | 0.08835 | 0.06388 | 0.05645 |
| im. sd. | 0.3672 | 0.2165 | 0.1326 | 0.08837 | 0.06406 | 0.05677 |
| J. sd. | 0.3575 | 0.2141 | 0.1322 | 0.08814 | 0.06382 | 0.05641 |
| \cdashline1-7 bias | 6.367e-3 | 4.247e-3 | -8.032e-4 | 1.281e-3 | -2.822e-4 | 3.048e-4 |
| im. bias | 3.284e-3 | 3.025e-3 | -1.176e-3 | 1.159e-3 | -3.229e-4 | 2.987e-4 |
| J. bias | -1.631e-3 | 1.419e-3 | -1.858e-3 | 8.354e-4 | -5.028e-4 | 1.345e-4 |
| nb/r | 0 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 |
|---|---|---|---|---|---|---|---|---|---|---|
| 3 (i) | 0.977 | 0.909 | 0.862 | 0.809 | 0.740 | 0.706 | 0.678 | 0.642 | 0.620 | 0.564 |
| 3 (a) | 0.777 | 0.710 | 0.736 | 0.730 | 0.735 | 0.720 | 0.727 | 0.726 | 0.725 | 0.755 |
| 5 (i) | 0.970 | 0.859 | 0.771 | 0.671 | 0.613 | 0.552 | 0.513 | 0.488 | 0.410 | 0.361 |
| 5 (a) | 0.785 | 0.752 | 0.725 | 0.715 | 0.700 | 0.760 | 0.708 | 0.693 | 0.699 | 0.705 |
| 8 (i) | 0.969 | 0.868 | 0.750 | 0.609 | 0.550 | 0.486 | 0.433 | 0.363 | 0.341 | 0.309 |
| 8 (a) | 0.773 | 0.742 | 0.717 | 0.756 | 0.722 | 0.698 | 0.714 | 0.675 | 0.684 | 0.622 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImbalanced Data Classification Techniques · Statistical Methods and Inference · Advanced Statistical Process Monitoring
The Limit Imbalanced Logistic Regression by Binary Predictors and its fast Lasso computation
Vincent Runge111E-mail: [email protected]
LaMME - Laboratoire de Mathématiques et Modélisation d’Evry.
UEVE - Université d’Evry-Val-d’Essonne.
Abstract
In this work, we introduce a modified (rescaled) likelihood for imbalanced logistic regression. This new approach makes easier the use of exponential priors and the computation of lasso regularization path. Precisely, we study a limiting behavior for which class imbalance is artificially increased by replication of the majority class observations. If some strong overlap conditions are satisfied, the maximum likelihood estimate converges towards a finite value close to the initial one (intercept excluded) as shown by simulations with binary predictors. This solution corresponds to the extremum of a strictly concave function that we refer to as ”rescaled” likelihood. In this context, the use of exponential priors has a clear interpretation as a shift on the predictor means for the minority class. Thanks to the simple binary structure, some random designs give analytic path estimators for the lasso regularization problem. An effective approximate path algorithm by piecewise logarithmic functions based on matrix inversions is also presented. This work was motivated by its potential application to spontaneous reports databases in a pharmacovigilance context.
Keywords: path estimator, pharamacovigilance model, piecewise logarithmic approximate path, limit class imbalance, rescaled likelihood, spontaneous reports database, square exact solution.
MS classification : Primary 62J12, 62F12, 62F15; secondary 34E05, 49M29, 62P10.
1 Introduction
If the response is very rare compared with the response , we are in presence of a rare event configuration also called class imbalance. This problem recently got computer scientists’ attention: they aimed at reducing computational costs by bypassing the class imbalance with resampling methods [12] [21] [6]. With these methods, the variance in estimating model parameters increases. Statisticians are aware of this problem and complex procedures such as local case-control sampling were proposed [8] (a method initiated in epidemiology [19]).
In a recent work (2007) by Art B. Owen [22], the opposite approach is considered: the class imbalance is infinitely increased in order to reach the theoretical distribution of the majority class observations. Owen proved that under some overlap conditions the model parameters are finite (apart from the intercept) and built a limit system of equations related to exponential tilting, whose solution is the new estimate. The resulting equations include the distribution of the infinite class expressed through integrals, which are not easy to infer. This may explain that this work was broadly ignored (The author found it when Sections 2 and 3 were already completed).
In our approach, the observations of the majority class are infinitely replicated and the Owen’s limit distribution becomes the observed distribution. This situation is a kind of degenerate case between resampling (we repeat observations) and infinitely class imbalance (the observed distribution is chosen as the theoretical one). Unlike Owen’s result, our limit normal equations can be interpretated as the first order conditions of a new likelihood.
The idea of this work comes from the analysis of highly imbalanced binary spontaneous reports databases. Such databases are gathered by many countries and institutions (FDA, MHRA, WHO,…). Imbalanced logistic regression with binary predictors gives maximum likelihood estimate (MLE) very close to its limit imbalanced counterpart. This result makes possible the study of lasso-type regularization problem and the development of effective algorithms to provide model selection.
So far, only disproportionality methods are routinely used [18] for spontaneous report databases: predictors are analysed one by one, leading to a great number of false positive signals [13]. Mathematical tools adjusted to binary data for regression are surprisingly barely developed by scientists (only boolean matrices have been studied by some authors [16]). This results in an inflation of empiric methods using lasso regularization in recent years (from [3] to [1]). This is a worrying trend because recommendations made by these experts shift towards more complicated experimental methods and time-consuming algorithms, not towards a deeper mathematical understanding. This work is motivated by the need to better analyse this kind of applied problem.
The paper contains three main sections in which we present the following results:
- •
In Section 2, we investigate the properties of the logistic normal equations with binary predictors. Simple existence and uniqueness conditions of Silvapulle’s type are found and some exact solutions presented. An invariance property in presence of intercept links this particular solution (called ”square solution”) to the limit imbalanced problem. We then acquaint ourselves with the issue of variance inflation of the imbalanced problem by computing the Fisher information.
- •
In Section 3, we derive Owen-type equations with a first order term evaluating the convergence rate. For the limit system of equations, the existence and uniqueness of the solution is proved with a new method leading to the minimization of a Kullback-Leibler divergence under linear constraints. A rescaling procedure on the initial likelihood and the previously found divergence justify the introduction of a rescaled likelihood corresponding to our limit imbalanced logistic regression problem. In a Bayesian framework, the Jeffreys penalty does not significantly decrease the variance of the estimator but other more appropriate priors, such that exponential ones, could help to reduce it (chosen according to the situation). The closeness in simulation between limit estimates and classical estimates compels us to go one step further with the study of regularization paths, in particular if the model is known to be sparse.
- •
In Section 4, we look at a lasso regularization problem for the rescaled likelihood, which has a clear interpretation as a shift on the predictor means for the class of interest. We succeed in finding some path estimators in a few particular cases (independence and orthogonal design). In presence of correlation, we present an effective path following algorithm by piecewise logarithmic functions giving precise estimates. We conclude by explaining the need of an analysis of the correlation structure between predictors. This leads to simple algorithmic procedures with small computational costs for which many different prior penalties could be easily tested. Two examples are given using the French spontaneous reports database.
The expressions ”infinitely imbalance” and ”limit imbalance” are considered as synonymous, although we recommend the use of the second one in our context due to the simple unique limit we impose and an analogy with hydrodynamic limits (in fluid dynamics) while the first expression is related to the underlying distribution introduced by Owen.
We conclude this article by discussing the many opportunities that arise with the introduction of a rescaled likelihood in a Bayesian context and of the path following algorithm by logarithmic functions.
2 The logistic regression by binary predictors
2.1 Logistic normal equations
The binary logistic regression (BLR) problem consists in the determination of coefficients maximizing a smooth and concave likelihood function given by the relation
[TABLE]
where is indexed from zero with corresponding to the intercept. Binary design matrices and with are of full rank: they aggregate the binary predictors. Vectors of weights and save repetitions for distinct observations in response classes [math] and separately. The binary structure favours repetitions in the sequence of observations, which justifies these notations. Moreover is the i-th component of vector (the same for ).
We introduce other notations thereafter used within this article. The modulus of a vector denotes its norm, while the overline sign on lower cases stands for normalization. For example and gives the vector . is the i-th row of the matrix and its roman upper case equivalent is the matrix in which the first column filled by ones (associated to the intercept) was removed. We also need with standing for the matrix transpose operator. An important feature in our study is the predictor means vector for class obtained by the relation . For vectors of same size , (resp. ) is the vector with components (resp. ), . is the vector without the intercept coefficient . From Subsection 3.2, the notations and for matrices and respectively are often used (as well as for integer ).
For ease of calculation, we consider the opposite of the log-likelihood. If , we have and we can introduce vectors and . In this latter case, we write
[TABLE]
and first order conditions are computed, differentiating with respect to each coefficient. We obtain
[TABLE]
or in matrix form
[TABLE]
In a general framework with non-identical matrices and , we likewise derive
[TABLE]
This system of equations (2.2) gathers the so-called logistic normal equations and will be widely used within this article.
Remark 2.1**.**
These equations are usually presented with a logistic function but we chose another expression to highlight the link with existence and uniqueness conditions.
2.2 Existence and uniqueness
Necessary and sufficient conditions to ensure existence and uniqueness of the MLE are well-known, they were established by Silvapulle in 1981 [27]. They consist in satisfying an overlap condition between the cones
[TABLE]
For the BLR problem, a more convenient description is possible:
Theorem 2.1**.**
The BLR problem admits a unique solution if and only if there exist and , such that .
Looking at equations (2.2), this theorem means that a MLE exists and is unique if one can find a couple of observations of the rows in and in such that vanishing all the regression coefficients (intercept included). An easy necessary condition to check is that at least one [math] and one are present in each column of and (at the exception of the first column of ones corresponding to intercept).
Proof.
If , the Silvapulle’s condition is immediately verified. Reciprocally, is an open subset of with positive measure because and are full rank matrices. By a density argument, there exist , and satisfying . We reorder the rows in and such that the first rows are linearly independent. Let in and in be orthogonal matrices to and respectively. Because of the reorganization of the rows in () we can choose a where its last rows form an identity matrix . For all and we have the relation . Again with a density argument, we find such that for all and satisfying the constraint . For a matrix , a vector and , let denote the vector , where (resp. ) corresponds to the submatrix of (resp. subvector of ) obtained by removing from (resp. from ) the columns (resp. rows) that do not correspond to the indices in . With this notation, we have . The binary matrix is then nonsingular and using its inverse in we obtain . Finally . The same arguments lead to a set of coefficients . Multiplying the vector by the ppcm of all its denominators proves the result.∎
2.3 The square case
The situation with identical square design matrices and is worthwhile in itself because it leads to explicit analytic formulae for the MLE and their variance (in the asymptotic case). In particular, we focus on the introduction of imbalance between and to emphasize the simple solution for MLE and the problem of variance inflation.
Theorem 2.2**.**
If is a square matrix, we have the following closed form for the maximum likelihood estimator:
[TABLE]
Proof.
the matrix verifies the condition and is nonsingular with its inverse (because is of full rank). The vector is defined as i.e. . Multiplying (2.1) by , we get . Hence, , which achieves the proof. ∎
Remark 2.2**.**
If one of the components in the vectors of weights or vanishes, some of the regression coefficients become infinite (but not necessarily all of them).
To our knowledge, this is the first general closed form found in the resolution of a logisitic regression. There exist partial results for a unique categorical predictor exposed by Lipovetsky in 2014 [17]. An explanation for the lack of such a simple result stands in the poorly studied finite observation structure made possible through binary predictors with repetitions. In Appendix A, some particular solutions to equations (2.3) are presented.
2.3.1 Invariance if intercept
We establish an invariance property making a link with the imbalanced problem.
Proposition 2.1**.**
In the square case with intercept, multiplying all the components of or by a same integer does not change the value of the MLE apart from the intercept.
Proof.
The inverse of a matrix with an intercept term verifies the relation
[TABLE]
which means that we can rewrite equations (2.3) as
[TABLE]
Substituting by (or by ) with gives the same result for all . ∎
2.3.2 Asymptotic variance
To conclude this section, we study the asymptotic behavior of the estimator for large and . Since the MLE (intercept excluded) remains the same with or without a class imbalance (see the invariance property), we have a glimpse of a general property in class imbalance.
Proposition 2.2**.**
In the square case BLR problem, the variance of the maximum likelihood estimator is approximately given by relations
[TABLE]
Proof.
We compute the observed Fisher information with a diagonal matrix with elements and . Its inverse gives the desired result, knowing that and . ∎
Remark 2.3**.**
Another method uses the closed form (2.3) to perform variance and bias estimations by Taylor expansions with the multinomial random vector . We obtain and . However, simulations give inaccurate results and only the Fisher information method should be retained.
We investigate the variation of the variance with respect to the sample size and the value of the intercept for a simple fixed model . With these two parameters given, we simulate data sets with a different random binary square matrix and different random vectors and for each of them (but is fixed). In table 1, we compare the estimated standard deviation (sd.) with the Fisher standard deviation given in Proposition 2.2 (F.sd.) accompanied by an estimation of the bias (bias) for coefficient .
These simulations highlight the accuracy of the ”Fisher variance” in all configurations, which is very close to the estimated one. Bias is negligible compared with variance. For a constant number of observations , the variance increases when the disbalance between classes strengthens. This variance inflation is a key issue in class imbalance, we further explain how one can easily add a prior information to a rescaled likelihood to deal with this problem (see Subsection 3.4).
3 Limit imbalanced study
3.1 Owen-type equations
The limit case consists in infinitely replicating the majority class observations as if the theoretical distribution of this class was the observed one. This is a degenerate case of the Owen’s study, that is why we know that the intercept coefficient tends to minus infinity whereas other regression coefficients are finite if a stronger overlap condition is satisfied [22]. For the limit equations, an information reduction for the majority class occurs: only the means of the predictors matter, the correlation structure in this class of interest ”disappears”.
The following proposition presents the logistic normal equations (2.2) in a new form with a remainder term arising in case of class imbalance.
Proposition 3.1**.**
For an imbalanced binary logisitic regression with a class size for response ’’ times greater than the one for response , we obtained the system of equations
[TABLE]
with . We used notations:
[TABLE]
and for vectors in :
[TABLE]
The technical proof of this result is exposed in Appendix B.
As shown by simulations (see table 2), the first order and remainder terms are negligible quantities with binary predictors, even if there is no imbalance! This suggests the introduction of the following limit imbalanced equations, obtained with in Proposition 3.1.
Theorem 3.1**.**
*For infinitely imbalanced binary logisitic regression verifying a strong overlap condition (see Theorem 3.2), the following system of limit imbalanced equations holds222With non-binary design matrices and and no vectors of weights, we obtain These equations also differ from Owen’s [22]. *
[TABLE]
Notice that the coefficients do not depend on the structure in rows of the design matrix associated to response but only on the means of ones for each predictor: .
We give a direct simple proof, avoiding the complicated previous proof of Appendix B.
Proof.
For near minus infinity, the hyperbolic tangent has the following first order expansion:
[TABLE]
From [22] we know that the intercept term tends to minus infinity, then with or , we use the previous expansion neglecting the remainder term. Thus, equations (2.2) become
[TABLE]
and factoring by in the first equation of this system we have
[TABLE]
because . Looking back at (3.2) without the first equation, we have
[TABLE]
but
[TABLE]
because and we obtain the desired result. ∎
In table 2, we present simulation results based on limit imbalanced equations (3.1) compared with classical logistic regression (2.2). The sample procedure is the same as the one used for table 1 except that we fixed sample size at and vary dimension for the matrix (we chose ).
The two estimates for standard and imbalanced regressions are very close to each other as shown by the mean of the norm – even if the problem is not imbalanced – so that standard deviation and bias are almost the same. This means that, if interesting properties can be established with the limit equations, this context will be appropriate to highlight new features in classical logistic regression.
The first order term in Proposition 3.1 should be estimated to understand how good the limit imbalanced approximation is, without having to estimate the standard regression coefficients. Simulations show that this term is very small and we choose not to dwell on this intermediate situation, but it could be a more important result if non-binary design matrices are involved.
3.2 Strong overlap condition and rescaled likelihood
Existence and uniqueness conditions to solve (3.1) are well-known [22], they consist in an overlap condition a little bit stronger than the one given by Silvapulle. In fact, we need the point to be surrounded by the rows of (hereafter denoted by the letter ). We give this result in the framework of the binary problem (simpler than Owen’s general case) and establish a new proof leading to a minimum relative entropy problem. From there and using duality, we build the corresponding rescaled likelihood also justified by a rescaling on the initial likelihood.
Theorem 3.2**.**
There exists a unique finite solution to the limit imbalanced BLR problem if and only if there exists such that and . (If present, the null row (such that ) is removed333in order to have non-zero coefficients as for the overlap condition in Theorem 2.1..)
Remark 3.1**.**
The condition means that we have with and so that . In other words, the existence and uniqueness of a solution for the limit problem implies existence and uniqueness for its associated BLR problem.
Our proof of this theorem is based on the following three lemmas.
Lemma 3.1**.**
The log-sum-exp function , defined by is a convex, continuous, increasing function on . The function , is continuous and convex on .
Proof.
Function h has a positive semi-definite Hessian and is then convex. Furthermore for all such that , , we have and the function is increasing on . The composition with an affine mapping preserves continuity and convexity. Thus, with and we obtain a convex continuous and . ∎
Lemma 3.2**.**
The function , is strictly convex on .
Proof.
The Hessian of , , is the following:
[TABLE]
For all , we have
[TABLE]
which is non-negative due to the Cauchy-Schwarz inequality. This expression is equal to zero if and only if there exists such that . Thus, only in the constant direction , the function is affine, in any others, this function is strictly convex.
Suppose that there exists a family of parameters such that and . This means that along the path described by the function is affine. We obtain and with , we have such that . This is impossible because the matrix is of full rank, which proves the lemma. ∎
We present a corollary to a theorem on the Legendre-Fenchel transform of convex composite functions exposed in [14].
Lemma 3.3**.**
If functions , are convex and continuous with and is convex, continuous and increasing with , then the convex conjugate of is given by
[TABLE]
with .
Proof of the theorem.
Let us define the function such that
[TABLE]
is differentiable on and the first order equations
[TABLE]
are equal to the system (3.1) with . Function is strictly concave as the sum of a concave function and a strictly concave function (see Lemma 3.2). Consequently, the solution to is unique.
We now introduce the convex conjugate of the function :
[TABLE]
We will prove that the three following sets are identical
[TABLE]
[TABLE]
[TABLE]
i) . If there exists solution to (3.1), that is . Moreover because of the strict concavity of . Thus .
ii) . We use the Lemma 3.3 with and the log-sum-exp function verifying the necessary conditions (Lemma 3.1). We have the convex conjugate if and elsewhere (we do not consider the presence of a null row ). The only way to obtain a finite result is to impose the constraint for all . Therefore, knowing that
[TABLE]
we have
[TABLE]
[TABLE]
We minimize a Kullback–Leibler divergence between two distributions under linear constraints. If one of the is zero, if elsewhere (see [14]) and the previous equalities remain true with . The KKT conditions of this problem impose the constraint for all . Thus,
[TABLE]
This minimum exists: this is a linear restriction to a convex and continuous function in a simplex and therefore .
iii) . If , then there exists such that and so that
[TABLE]
If the supremum is reached, there is a miximizing element and , this element is the solution to the system (3.1) and thus . To state this result, it is enough to have coercive. Let be an arbitrary vector and with . Then,
[TABLE]
with . Notice that the vector can not satisfy the relations because is of full rank. Thus, if , we have
[TABLE]
and because . Therefore, with
[TABLE]
which proves that the function is coercive when and achieves the proof. ∎
The expression in (3.4) is the minimization of a relative entropy between the class 0 distribution and a kind of ghost class 1 distribution (built on the design matrix). With the duality property, we can introduce a new likelihood. The following proposition leads to the same ”limit” likelihood and justifies the use of the adjective ”rescaled”. Indeed:
Proposition 3.2**.**
The limit imbalanced equations arise from the following rescaled likelihood:
[TABLE]
Proof.
With the initial likelihood
[TABLE]
and the relation (see 3.3), we obtain the following expression for the likelihood, using notation :
[TABLE]
We consider that is large enough to consider the limit (with fixed) and to make the approximations
[TABLE]
and
[TABLE]
thus, using a rescaling term,
[TABLE]
as tends to minus infinity because . ∎
The reader can see an analogy in physics with the existence of different scales of modelization. For example, the discrete mincroscopic N-body problem changed into the mesoscopic Boltzmann equation using the Boltzmann-Grad limit. See the book [25] for further information on hydrodynamic limits.
This new likelihood makes now possible to consider a wide range of problems, related to variance reduction using simple prior penalties (Subsection 3.4) or regularization (Section 4).
3.3 The relative entropy dual problem
With a likelihood and an entropy, we benefit from two points of view in order to numerically estimate the regression coefficients. The classical approach using a Newton-Raphson algorithm associated to the likelihood can be challenged by other algorithms on the primal or dual problems as described in [20] and [30] for classical logistic regression. We present here the dual problem and its link with initial regression coefficients. We leave the numerical analysis to another study.
Proposition 3.3**.**
The regression coefficients of the limit imbalanced regression are given by the formulae
[TABLE]
where is the probability distribution solving a relative entropy problem with linear constraints
[TABLE]
* and with .*
Proof.
With the existence of a unique solution (see Subsection 3.2), there exists a solution such that , , and
[TABLE]
Then, using equations (3.1) we obtain
[TABLE]
Let in be an orthogonal matrix to (the previous relation remains true with instead of ) and , such that we can remove to obtain the relation
[TABLE]
hence,
[TABLE]
Summing all these relations with weights , using the fact that , gives
[TABLE]
Due to convexity of the Kullback-Leibler divergence, we have a unique minimum obtained (by definition of ) at . Therefore
[TABLE]
and the result is proved if is of full rank. Suppose that this is not the case. Then, there exists such that , therefore with a vector with identical components all equal to . Consequently, the matrix (that is with the intercept column of ones) is no more of full rank, which is, by definition of , impossible. ∎
3.4 Priors for variance reduction and a priori information
The rare events structure of class imbalance goes hand in hand with the problem of precision for estimates. A classical solution consists in introducing an a priori distribution in a Bayesian context. This can be done using a Jeffreys non-informative prior [15] allowing both first order bias removal and variance shrinkage [7]. Thus, we have to maximize the expression
[TABLE]
with the determinant of the Fisher information matrix. This approach is implemented in the R package logistf for logistic regressions. In the imbalanced case, we search for a method conserving the shape of the limit equations and achieving at the same time variance reduction: we choose the following approximation
[TABLE]
supposing an absence of correlation between predictors in a random design framework (see Section 4). With this hypothesis, we derive first order equations
[TABLE]
thus,
[TABLE]
In table 3, we simulate data sets as previously done with the length for fixed (to ) and we compare estimated bias and variance for coefficient with three different methods: a classical logistic regression (bias and sd.), the imbalanced case with means (im. bias and im. sd.) and the Jeffreys exact penalty (J. bias and J. sd.).
Variance reduction is about 2 percents with the Jeffreys prior and the half as much its easily computable approximation in class imbalance. Bias was already small and gets a little smaller. The shrinkage of the variance is limitated by the Cramér-Rao bound (see Fisher variance in table 1) and no miraculous reduction was conceivable.
In the next section, we consider path following methods to complete regularization and highlight its ”simplicity” with binary data. The initial parameters being the maximum a posteriori estimate (MAP), this estimation is a central problem of the limit imbalanced study. The benefit of the rescaled likelihood compared with the standard one is in the easy use of exponential a priori penalties. Indeed, with the penalty444P could be written as a probability distribution with a normalization term (the support of regression coefficients is finite).
[TABLE]
where , we maintain the shape of the likelihood by only perturbing the predictor means vector by (the MAP exists if and only if is surrounded by the rows of , see Theorem 3.2).
4 Path estimators for Lasso-type regularization
In ths section, we consider that each observation () is generated by a random binary vector with , . With this modelization, we find many path estimators depending on the underlying correlation structure of the random design.
4.1 Limit lasso properties
The well-known lasso regularization consists in introducing a positive parameter defining the strength of a Laplace prior distribution [29]. We search for the maximum of the expression
[TABLE]
which verifies the following simple first order conditions. Notice that we use, from now on, the notation instead of to facilitate the reading.
Proposition 4.1**.**
The limit imbalanced BLR problem with lasso penalty leads to the system of equations
[TABLE]
with and , if , , if for all ( is the subgradient of the norm).
Thus, the lasso has a clear interpretation as a shift operating on the observed proportions . Thereafter, we often use the vector defined as .
Proposition 4.2**.**
If the strong overlap condition in Theorem 3.2 is satisfied, then the function
[TABLE]
is continuous for all and there exists , such that .
Proof.
With the positivity of , we have,
[TABLE]
and for all , is a strictly convex and coercive function in if the strong overlap condition is satisfied (see proof of Theorem 3.2). Therefore, the function is well defined for all . Furthermore, this function is continuous because of the continuity in of and its strict concavity in . The equations (4.1) with have no solution if one of the components of is equal to or , therefore for all . ∎
Remark 4.1**.**
Using the law of large numbers, the family of model parameters solves the system of equations
[TABLE]
with being the expectation operator. This previous system of equations takes the same form as in (4.1) because is a discrete random vector and therefore the path estimator is continuous. Notice that the function : is also continuous in .
4.2 Path estimators
Thanks to this previous remark, we are able to find precise analytic estimators of the path in the case of independent and orthogonal random designs. Notice that such solutions already exist in the framework of linear regression (see [29]). From now on, the strong overlap condition is considered to be always satisfied at .
Theorem 4.1**.**
If the random vector generating the observations has independent components, a precise path estimator is given by the formulae
[TABLE]
[TABLE]
The coefficients are give by the classical MLE (solution of equations (2.2) without intercept) if we want to estimate the path obtained by an (imbalanced) logisitic regression. If we use the limit equations, we need the MLE of the rescaled likelihood (Proposition 3.2 and equations (3.1)) and in this case:
[TABLE]
Proof.
For all we use the hypothesis of independence:
[TABLE]
[TABLE]
and the solution is
[TABLE]
and if . Indeed, is negative in region and positive in region . for a random design with independent predictors (see Appendix A.2 with ). We replace all the by the frequencies of observations to obtain the estimator. ∎
The orthogonal case, when the inner product between columns of the design matrix vanishes (, ), is also tractable.
Theorem 4.2**.**
If the random design is orthogonal, we have filled by zeros except at positions , and the derivative of the path estimator takes the form
[TABLE]
with a family of subsets of containing the indexes of non-zero coefficients of vector at time . The algorithm that describes the positions of the change-points in is described in the proof.
Proof.
With the hypothesis of orthogonality, equations (4.2) are reducted to
[TABLE]
and we obtain
[TABLE]
Let and , then
[TABLE]
After computation, we have explicit formulae for the continuous functions and ():
[TABLE]
with , and . These functions are monotonous, we need the change-points to draw the path, that is the finite sequence of different models , . For all , is a unique subset. If and we know we determine ,, and by solving
[TABLE]
We define and the two adjacent change-points are given by
[TABLE]
Therefore,
[TABLE]
with .
The path can be built forward or backward. If we choose the path following approach (forward), is found using the MLE of the rescaled likelihood (see Section 3) and . In the other configuration (backward), we have and for , and , so that . ∎
Simulations with this type of design show that each path usually vanishes only one time (and does not reappear) and thus is a very rare (impossible?) configuration.
The opposite situation to orthogonality is inclusion. For example, if is included in meaning that for the observed data if , we find an analytic description of the estimator given by the formulae
[TABLE]
This solution is likely generalizable (with a design in stairs as presented in Appendix A.4), however, this case is meaningless in the analysis of spontaneous reports databases and then left aside.
We give examples of plots of path estimates compared with a standard (using not ) lasso path for different imbalance strengths in appendix C. The results highlight the high quality of the analytic path estimators, even in absence of class imbalance.
Remark 4.2**.**
Another regularization method is called the elastic net penalization and uses, in addition to the lasso, a second penalized term of ridge (or Tikhonov) kind [31]:
[TABLE]
with . In the case of independence in random vector , we have an explicit formula for with respect to :
[TABLE]
for between [math] and . The coefficients vanish when The proof of this result is a simple adaptation of the proof for the lasso in Theorem 4.1.
4.3 Negative correlation structure
If the random design verifies the relations , , , this in-between situation of a -dependent negative correlation between variables () is also tractable and particularly interesting in the sparse context of near-zero components for vector 555Spontaneous reports databases are an example of such a sparsity with negative correlation.. We find two estimators that sourrunded the real path.
Theorem 4.3**.**
The path estimator in the -dependent negative correlation case is surrouned by estimators, whose derivatives are given by
[TABLE]
with and . With the rare occurrence of resurgence of a coefficient after vanishing, we neglect this possibility and we easily find the vanishing points and thus the family of subsets .
Proof.
We differentiate equations (4.1) with respect to considering only the equations verifying the condition , i.e. . We obtain at time ,
[TABLE]
[TABLE]
or written differently,
[TABLE]
where is a t-dependent proportion of rows with a one on the columns and . With only negative correlations or independence between components of , we define the matrix with as long as ,
[TABLE]
if observations give such a matrix . We obtain with a diagonal matrix filled with the elements . Matrix is the correlation-track matrix containing ones at positions if and we have666the non-singularity of the matrix in (4.7), , will be proven with Proposition 5.1.
[TABLE]
so that, using the positivity of all the elements in matrix :
[TABLE]
with the diagonal matrix filled with vector and with vector . Finally,
[TABLE]
∎
In presence of sparsity (small components in ), and , which makes previous upper and lower bounds good path estimators. The (or more) change-points are determined step by step as in previous subsection and the estimated path is stucked between a lower path and an upper paths.
5 Efficient algorithms for Lasso regularization
In this last section, we propose two new algorithms drawing piecewise logarithmic approximate paths derived from a small amount of matrix inversions ( or more). The logarithmic function naturally arised in the expression of all previously found path estimators, consequently, we build approximations involving this function. The main benefit of our algorithms is the direct computation of the sequence as done by the LARS [5] for linear regression. Our first algorithm follows the path ( increases) and is a simplified procedure adapted to data with a low correlation structure. The second algorithm is a backward procedure ( decreases toward zero) and can challenge the classic coordinate descent approach [9]. The efficiency of the algorithms are eventually illustrated on pharmacovigilance data.
5.1 Cauchy problem
The derivative of the first order equations for the Lasso with respect to leads to a Cauchy problem.
Proposition 5.1**.**
The Lasso regularization path is described by the following system of differential equations
[TABLE]
with (), , and
[TABLE]
Proof.
Equations (4.7) are divided by vector and we obtain the desired equations. It remains to be proven the non-singularity of matrix for all .
With diagonal matrix filled by elements we build a matrix whose elements are:
[TABLE]
Suppose that this matrix is singular, then there exists a non-identically null vector such that or written component-by-component
[TABLE]
We compute the linear combination to obtain after computations
[TABLE]
with . This relation is expanded and simplified into
[TABLE]
This is a sum of positive terms equals to zero, meaning that each term wanishes and we get for all . Thus which is impossible because matrix is a full rank matrix. ∎
5.2 The piecewise logarithmic approximate path : a first simple algorithm
Path following algorithms [23] are competing methods with more used coordinate descent algorithms [9] [10]. We here present a simple algorithm for an increasing regularization parameter . Within this procedure, we are able to estimate at each step the value of the next wanishing component in vector and thus speeding up the classical Newton-Raphson step [23]. We consider that correlation between predictors is ”low”, so that an emergence of a coefficient along the path after wanishing is not taken into account (but this case is included in the second algorithm).
Proposition 5.2**.**
*The path following algorithm for limit imbalanced logisitic regression by binary predictors (with low correlation) is the following:
, , given. .
WHILE DO*
[TABLE]
with such that
[TABLE]
The path, on the segment , is given by
[TABLE]
[TABLE]
* becomes .
END DO.*
Proof.
Equations (4.7) take the form with called correction matrix.
[TABLE]
Between two annulations of regression coefficients along the path ( and ), we consider this matrix to be constant (). In this case,
[TABLE]
We have , but the sequence of values is unknown. However, we iteratively approximate them as follows. With
[TABLE]
[TABLE]
because is small for relative small step . We obtain the piecewise logarithmic path:
[TABLE]
with
[TABLE]
is the set of values for solving (5.2) with (for each ). The set gives at each step the indexes of regression coefficients to remove from . ∎
Other approximations could be performed, for example using a second order term in the previous approximation (5.2). Simulation tests show that our choice seems to give better results. We notice that the size of the matrix decreases during this procedure, speeding up the computation at each new step .
Remark 5.1**.**
This algorithm has two main computational advantages. Firstly, the sequence is directely determined, whereas other algorithms use a regular discretization on a logarithmic scale (coordinate descent) or Newton-Raphson steps (path following). Secondly, the sum does not appear in the matrices, which can highly reduce the computational cost especially if the matrix is sparse ( of ones in the French spontaneous reports data base): this algorithm handles sparsity!
To explore the efficiency of the algorithm, we simulate data sets with different correlation structures. Model selection is often provided with the BIC [26], which requires to know the different models arising along the path. Hence, we decide to evaluate the algorithm accuracy using a simple indicator: a comparison of the sequence of coefficients in the order of wanishing along the path. The indicator is if a simulation with our algorithm gives coefficients at the same index as in the sequence obtained by a classical lasso algorithm (coordinate descent in R package glmnet). The correlation coefficient (from to ) means that we chose initial .
We simulate paths for each number and , being the number of predictors in correlation. For each path, and the regression coefficients are always the same and chosen on a regular scale between and .
With or correlated predictors over the 10 used, the exact solution with assumption of independence (i) (see Theorem 4.1) deteriorates with the increase in correlation (r), which is (almost) not the case if we use our algorithm (a). Notice that, with a result around , the approximate path is often very close to the exact one, this is due to the inversion in the sequence of two close terms (see (5.1)).
5.3 A new algorithm
The second algorithm presented in this section computes forward selection. It is more suitable for problems with a large number of predictors (when we are looking for a sparse model) or/and in presence of a strong correlation structure.
The standard approach for computing regularization path by decreasing with logistic regression consists in using a first order quadratic approximation of the first derivative of the likelihood between two consecutive closed solutions (that is in practice, two parameters and such that is small). Using small steps for the parameter sequence to ensure a good approximation, the path is drawn by the cyclical coordinate method (see [9] and the R package glmnet). Our new algorithm is a kind of equivalent of the LARS algorithm for the logistic regression : we compute large step in . Furthermore, in comparison with the cyclic coordinate descent algorithm, there is no loop at a fixed parameter . After presenting the algorithm, we challenge the glmnet package with our approach.
Proposition 5.3**.**
*The backward algorithm for limit imbalanced logisitic regression by binary predictors is the following:
, , and given. .
WHILE ( or ) DO*
[TABLE]
with
[TABLE]
and
[TABLE]
Definitions for matrices and are given in the proof. Notice that (as for ). The path, on the segment , is given by
[TABLE]
and for the subgradients
[TABLE]
The new set is given by
[TABLE]
with
[TABLE]
* becomes .
END DO*
Proof.
We differentiate equations (4.1) for all in (see also (4.7)):
[TABLE]
or in matrix form with , , and vectors , and we get
[TABLE]
is a square non-singular matrix for all in (see Remark 5.1). Between two consecutive values and () of the sequence, we consider that and , thus
[TABLE]
with and . The system of equations involving matrix is solved as in the proof of Proposition 5.2 and we get
[TABLE]
with . The second set of equations gives
[TABLE]
and using the usual approximation
[TABLE]
with and we find
[TABLE]
We solve equations (, and , ) to find the possible values for . The maximum of obtained negative values within the results is used to build the sequence. ∎
To visualize what is happening during the algorithm, we define linear functions and leading to the functions () such that
[TABLE]
Functions are all piecewise linear and can be drawn in the plane shown in Figure 1.
5.4 Path reconstruction with the French spontaneous reports database
We illustrate the efficiency of the limit path construction by piecewise logarithmic functions on the French spontaneous reports database. We look at two examples, a first one with no evidence of correlation and a second with strong correlations. The database contains about 330000 reports in 2016 and the imbalance is high or very high for all the adverse effects [2]. In the following graphs, the dotted lines represent results obtained by our algorithm, the solid ones result from the classical glmnet package.
The Figure 3 shows common features encountered with other examples. The path of the exponential of the coefficients shapes a set of piecewise linear functions and the algorithm remains efficient even if the number of predictors is high (150 for examples). It seems that there is no case of a path with a curve reappearing after a first canceling (due to a strong correlation between predictors with opposite signs of initial coefficients). Thus, the sets in the algorithm do not have to be determined.
We notice that the accuracy of this path following algorithm can easily be increased by adding intermediate steps (in variable t). The main computing limitation being the matrix inversion, one could study the inner product (Gram) matrix for class 0 and reorder rows and columns to reveal patterns and form a block diagonal matrix. These blocks could result from a statistical study of the Gram matrix777To that end, see the literature of the block clustering problem [11]. (finding the pairwise independent predictors) as well as from pharmacological assumptions (medical treatments also shape patterns). Thus, computational costs become a marginal problem and one can concentrate on the bias correction by adding priors related to temporal bias, under-reporting or the introduction of similarity modifying the matrix888With a similarity matrix , the similarity is defined as follows: coefficient becomes ..
6 Conclusion and perspectives
The central novelty of this work is the introduction of a rescaled likelihood for the limit imbalanced logistic regression problem. The expression of this likelihood could have some connexions with the well-known likelihoods of the self-controlled case series method [28] and of the proportional hazards model [4] used in epidemiology.
Most results exposed for binary data can be extented to other data types. However, simulations have been done only with binary data, having in mind the underlying applied problem of pharmacovigilance. The new estimate is always very close to the initial MLE because data are located on the vertexes of the hypercube and then one another ”close”. A convergence study of all possible existing algorithms for the primal and dual problems could be performed with different class imbalances and an evaluation of the first order term.
The variance reduction is a central issue that has to be treated in a Bayesian framework. Whereas the prior to add in the standard logistic regression is unclear, the rescaled likelihood takes a well-adapted form for exponential priors. We considered model selection using the BIC and the lasso to answer this question. Due to binary data, the lasso regularization problem became easier to understand in our limit imbalanced case: we found many precise estimators. Piecewise logarithmic approximate paths are built by an effective path following procedure which determines step by step the vanishing time of each path, do not use any loops as in coordinate descent algorithms and computes expressions only involving non-zero data. Moreover, this algorithm can take into account the correlation structure between predictors to further shrink computational costs. The values for , and for matrix could be shifted in order to incorporate absolute bias, temporal bias, under-reporting and similarity or correlation corrections.
7 A pharmacovigilance project?
Within this paper, we have had in mind the pharmacovigilance context as this work was carried out in parallel of a one-year engineering job at the French National Institute of Health and Medical Research999B2PHI laboratory UMR 1181, INSERM, UVSQ, Institut Pasteur, Villejuif 94807, France. We hope this article could contribute a little to the developement of mathematical tools for pharmacovigilance purposes. The science of drug safety at a postmarketing level is nearly non-existent in France as in many other countries: the reporting process of spontaneous reports is inadequate and resulting databases are badly processed with unadapted tools. Public health scandals related to medication are steadily increasing and the spotlights are turned towards big pharmaceutical companies while patient associations should firstly require public authorities to establish a modern drug safety structure. To that end, the statistical community has a major role to play by proposing trustworthy decision-support tools, opposing science to political and financial influences. Creating a useful tool was the guideline of this present work and the author hopes that other mathematicians will embrace the direction initiated by this article.
We would like to conclude by giving our opinion about the work that remains to be done to obtain an operational tool (in five points), hoping that it will inspire epidemiologists.
- Building priors related to bias (temporal bias, under-reporting…) with the help of pharmacologists. 2) Developing the proposed regularization algorithms evaluating their complexity and accuracy levels. 3) Introducing simple indicators to control the quality of the limit approximation. 4) Working on path visualization and new indicators (that are not thresholds). 5) Evaluating the obtained tool in the hands of pharmacologists (the use of reference sets is, to our mind, inadequate).
Acknowlegment
I would like to deeply thank Laetitia Comminges from the Paris-Dauphine University for relevant comments that greatly improved the manuscript. I also thank my colleague Mohammed Sedki from the INSERM laboratory of Villejuif for his constant encouragement to complete this work.
Appendix A Exact solutions
We give a collection of examples consisting in simple solutions of the equation (2.3).
A.1 No intercept
If there is no intercept and no interaction between the regressors, the matrix equals the identity matrix and
[TABLE]
If one row contains other ones, the inverse matrix is the same matrix with the added ones transformed into its opposite.
A.2 Intercept
If the square matrix is the following
[TABLE]
for the inverse matrix, so that the coefficients take the form
[TABLE]
A.3 Intercept with one correlation
The first row of the following matrix
[TABLE]
definies the set . The case is left out because it does not coincide with a non-singular matrix. The case corresponds to the previous example. The easiest way to solve this example is to look at initial equations (2.1). We write down the equations, where only the first one has a different form:
[TABLE]
and for ,
[TABLE]
Subtracting all the equations to the first one, we obtain
[TABLE]
which can be used to simplify equations (A.1) into
[TABLE]
Finally, we have
[TABLE]
so that we deduce the following closed form for the coefficients
[TABLE]
[TABLE]
Notice that the regression coefficients behave in a very unpredictable way. It is sufficient to see that on an example with and . The matrix is
[TABLE]
and we have
[TABLE]
The first intuition is to think that coefficients and do depend on the couples and respectively, but it is not the case!
A.4 Stairs
With
[TABLE]
and then
[TABLE]
Appendix B Proof of Proposition 3.1.
The result is proved with a succession of Taylor expansions of degree or in . We use
[TABLE]
then with (2.2), we have
[TABLE]
The first equation of this system gives
[TABLE]
therefore, using notations introduced in the proposition,
[TABLE]
and
[TABLE]
The coefficient is defined as . Now we have to find the Taylor expansion of of degree two in and reinject it in (B.1). We have
[TABLE]
[TABLE]
and
[TABLE]
[TABLE]
[TABLE]
Therefore
[TABLE]
[TABLE]
[TABLE]
[TABLE]
The system of equations (B.1) without its first equation is
[TABLE]
and we use the previous expression for :
[TABLE]
[TABLE]
or
[TABLE]
Then, using again a Taylor expansion,
[TABLE]
or
[TABLE]
Appendix C Path simulations
In the following graphs, the dotted lines are obtained by the exact path corresponding to Theorem 4.1 for examples 1 and 2, Theorem 4.2 for examples 3 and 4 and the inclusion case for examples 5 and 6. The solid lines are always given by a coordinate descent algorithm for the standard logistic regression. We change the scale for (by a linear rescaling) in order to have to same (see (5.1)) for the exact and algorithmic paths.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Ahmed Ismail, Pariente Antoine, Tubert-Bitter Pascale (2016) Class-imbalanced subsampling lasso algorithm for discovering adverse drug reactions. Statistical Methods in Medical Research.
- 2[2] Beziz et al. (2016) Spontaneous adverse drug reaction reporting in France: A retrospective analysis of reports made to the French medicines agency from 2002 to 2014. Revue d’Épidémiologie et de Santé Publique , 64 .
- 3[3] Caster et al. (2010) Large-Scale Regression-Based Pattern Discovery: The Example of Screening the WHO Global Drug Safety Database. Stat. Anal. Data Min. , 3 , no. 4, 197–208.
- 4[4] Cox David Roxbee (1975) Partial likelihood. Biometrika , 62 , no. 2, 269–276.
- 5[5] Efron Bradley, Hastie Trevor, Johnstone Iain, Tibshirani Robert (2004) Least angle regression. Annals of statistics , 32 , no. 2, 407–499.
- 6[6] Elrahman Shaza, Abraham Ajith (2013) A Review of Class Imbalance Problem. Journal of Network and Innovative Computing. , 1 , 332–340.
- 7[7] Firth David (1993) Bias reduction of maximum likelihood estimates. Biometrika , 80 , no. 1, 27–38.
- 8[8] Fithian William, Hastie Trevor. (2014) Local case-control sampling: efficient subsampling in imbalanced data sets. Ann. Statist. , 42 , no. 5, 1693–1724.
