Nearly Optimal Subdata Selection
Min Yang, Wei Zheng, John Stufken, Ming-Chung Chang, Ting Tian, Xueqin Wang

TL;DR
This paper introduces a new, efficient methodology for selecting subdata that retains maximal information for parameter estimation, approaching the optimal solution despite the NP-hard nature of the problem.
Contribution
It develops a novel algorithm based on optimal design theory that applies broadly, supports multiple criteria, and provides efficiency bounds, outperforming existing methods.
Findings
The new method produces highly efficient subdata selections.
It offers tight bounds for assessing subdata efficiency.
The algorithm converges and is applicable to general models.
Abstract
When, in terms of the number of data points, the size of a dataset exceeds available computing resources, or when labeling is expensive, an attractive solution consists of selecting only some of the data points (subdata) for further consideration. A central question for selecting subdata of size from available data points is which points to select. While an answer to this question depends on the objective, one approach for a parametric model and a focus on parameter estimation is to select subdata that retains maximal information. Identifying such subdata is a classical NP-hard problem due to its inherent discreteness. Based on optimal approximate design theory, we develop a new methodology for information-based subdata selection, resulting in subdata that approaches the optimal solution. To achieve this, we develop a novel algorithm that applies to a general model,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
