A model-free subdata selection method for classification
Rakhi Singh

TL;DR
This paper introduces a model-free subdata selection method called PED, which uses decision trees and random forests to efficiently select representative data for classification, improving accuracy over existing methods.
Contribution
The paper proposes a novel model-free subdata selection approach for classification that does not rely on underlying model assumptions, enhancing robustness and accuracy.
Findings
PED subdata results in smaller Gini index than uniform sampling.
PED subdata achieves higher classification accuracy in simulations.
Method is effective for multiple classes and various predictor types.
Abstract
Subdata selection is a study of methods that select a small representative sample of the big data, the analysis of which is fast and statistically efficient. The existing subdata selection methods assume that the big data can be reasonably modeled using an underlying model, such as a (multinomial) logistic regression for classification problems. These methods work extremely well when the underlying modeling assumption is correct but often yield poor results otherwise. In this paper, we propose a model-free subdata selection method for classification problems, and the resulting subdata is called PED subdata. The PED subdata uses decision trees to find a partition of the data, followed by selecting an appropriate sample from each component of the partition. Random forests are used for analyzing the selected subdata. Our method can be employed for a general number of classes in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Machine Learning and Data Classification · Data Mining Algorithms and Applications
MethodsLogistic Regression
