Probabilistic Bounds for Data Storage with Feature Selection and Undersampling
Ghurumuruhan Ganesan

TL;DR
This paper develops probabilistic bounds for data storage efficiency considering feature selection and undersampling, with applications to correlated sources, nonstationary data, and imbalanced datasets.
Contribution
It introduces new bounds for encoding correlated sources with feature selection and undersampling, extending classical information theory results to practical data science scenarios.
Findings
Derived a Slepian-Wolf type bound for nonstationary data encoding.
Established optimal bounds for neighborhood domination in random graphs.
Illustrated the application of bounds to k-nearest neighbors classification.
Abstract
In this paper we consider data storage from a probabilistic point of view and obtain bounds for efficient storage in the presence of feature selection and undersampling, both of which are important from the data science perspective. First, we consider encoding of correlated sources for nonstationary data and obtain a Slepian-Wolf type result for the probability of error. We then reinterpret our result by allowing one source to be the set of features to be discarded and other source to be remaining data to be encoded. Next, we consider neighbourhood domination in random graphs where we impose the condition that a fraction of neighbourhood must be present for each vertex and obtain optimal bounds on the minimum size of such a set. We show how such sets are useful for data undersampling in the presence of imbalanced datasets and briefly illustrate our result using~\(k-\)nearest neighbours…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Privacy-Preserving Technologies in Data · Complexity and Algorithms in Graphs
