Model-free Subsampling Method Based on Uniform Designs

Mei Zhang; Yongdao Zhou; Zheng Zhou; Aijun Zhang

arXiv:2209.03617·stat.ME·September 9, 2022·IEEE Trans. Knowl. Data Eng.

Model-free Subsampling Method Based on Uniform Designs

Mei Zhang, Yongdao Zhou, Zheng Zhou, Aijun Zhang

PDF

Open Access

TL;DR

This paper introduces a model-free subsampling method using uniform designs and a new criterion, GEFD, which outperforms traditional methods especially under model misspecification.

Contribution

It proposes a novel model-free subsampling strategy based on uniform designs and the GEFD criterion, with theoretical analysis and practical validation.

Findings

01

The method outperforms random sampling in simulations and real data.

02

It remains robust under various model specifications.

03

The approach is less sensitive to model misspecification than existing methods.

Abstract

Subsampling or subdata selection is a useful approach in large-scale statistical learning. Most existing studies focus on model-based subsampling methods which significantly depend on the model assumption. In this paper, we consider the model-free subsampling strategy for generating subdata from the original full data. In order to measure the goodness of representation of a subdata with respect to the original data, we propose a criterion, generalized empirical F-discrepancy (GEFD), and study its theoretical properties in connection with the classical generalized L2-discrepancy in the theory of uniform designs. These properties allow us to develop a kind of low-GEFD data-driven subsampling method based on the existing uniform designs. By simulation examples and a real case study, we show that the proposed subsampling method is superior to the random sampling method. Moreover, our method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Statistical Methods and Inference