TL;DR
This paper develops a new theoretical framework for high-dimensional logistic regression, revealing that classical inference methods are unreliable when the number of variables grows proportionally with the sample size, and proposes adjustments for accurate inference.
Contribution
It introduces a modern maximum-likelihood theory that accurately predicts bias, variance, and distribution of estimators in high-dimensional logistic regression, improving inference reliability.
Findings
MLE is biased and has greater variability than classical theory predicts.
Likelihood-ratio test does not follow chi-square distribution in high dimensions.
Proposes a procedure to adjust inference based on a single scalar parameter.
Abstract
Every student in statistics or data science learns early on that when the sample size largely exceeds the number of variables, fitting a logistic model produces estimates that are approximately unbiased. Every student also learns that there are formulas to predict the variability of these estimates which are used for the purpose of statistical inference; for instance, to produce p-values for testing the significance of regression coefficients. Although these formulas come from large sample asymptotics, we are often told that we are on reasonably safe grounds when is large in such a way that or . This paper shows that this is far from the case, and consequently, inferences routinely produced by common software packages are often unreliable. Consider a logistic model with independent features in which and become increasingly large in a fixed ratio. Then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
A Modern Maximum-Likelihood Theory for High-Dimensional Logistic Regression· youtube
