Surprise sampling: improving and extending the local case-control sampling
Xinwei Shen, Kani Chen, Wen Yu

TL;DR
This paper introduces a generalized sampling scheme based on data 'surprise' that enhances stability and efficiency in classification tasks, especially with large, imbalanced datasets, and extends existing methods with theoretical guarantees.
Contribution
It proposes a new adaptive sampling method based on data 'surprise' that generalizes and improves upon local case-control sampling, with theoretical and practical advantages.
Findings
The new sampling scheme performs at least as well as existing methods.
It is robust to model misspecification and dependent pilot estimators.
Numerical studies support the theoretical claims.
Abstract
Fithian and Hastie (2014) proposed a new sampling scheme called local case-control (LCC) sampling that achieves stability and efficiency by utilizing a clever adjustment pertained to the logistic model. It is particularly useful for classification with large and imbalanced data. This paper proposes a more general sampling scheme based on a working principle that data points deserve higher sampling probability if they contain more information or appear "surprising" in the sense of, for example, a large error of pilot prediction or a large absolute score. Compared with the relevant existing sampling schemes, as reported in Fithian and Hastie (2014) and Ai, et al. (2018), the proposed one has several advantages. It adaptively gives out the optimal forms to a variety of objectives, including the LCC and Ai et al. (2018)'s sampling as special cases. Under same model specifications, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
