Surprise sampling: improving and extending the local case-control   sampling

Xinwei Shen; Kani Chen; Wen Yu

arXiv:2007.02633·stat.ME·May 7, 2021

Surprise sampling: improving and extending the local case-control sampling

Xinwei Shen, Kani Chen, Wen Yu

PDF

TL;DR

This paper introduces a generalized sampling scheme based on data 'surprise' that enhances stability and efficiency in classification tasks, especially with large, imbalanced datasets, and extends existing methods with theoretical guarantees.

Contribution

It proposes a new adaptive sampling method based on data 'surprise' that generalizes and improves upon local case-control sampling, with theoretical and practical advantages.

Findings

01

The new sampling scheme performs at least as well as existing methods.

02

It is robust to model misspecification and dependent pilot estimators.

03

Numerical studies support the theoretical claims.

Abstract

Fithian and Hastie (2014) proposed a new sampling scheme called local case-control (LCC) sampling that achieves stability and efficiency by utilizing a clever adjustment pertained to the logistic model. It is particularly useful for classification with large and imbalanced data. This paper proposes a more general sampling scheme based on a working principle that data points deserve higher sampling probability if they contain more information or appear "surprising" in the sense of, for example, a large error of pilot prediction or a large absolute score. Compared with the relevant existing sampling schemes, as reported in Fithian and Hastie (2014) and Ai, et al. (2018), the proposed one has several advantages. It adaptively gives out the optimal forms to a variety of objectives, including the LCC and Ai et al. (2018)'s sampling as special cases. Under same model specifications, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.