On the Complexity of Labeled Datasets
Rodrigo Fernandes de Mello

TL;DR
This paper addresses the open problem of analytically computing the Shattering coefficient in Statistical Learning Theory, providing methods to estimate dataset complexity, sample sizes, and embedding effects for both binary and multi-class datasets.
Contribution
It introduces a novel approach using Topology, combinatorics, and data separability to compute the Shattering coefficient, advancing understanding of dataset complexity and sample requirements.
Findings
Computed Shattering coefficient for binary and multi-class datasets.
Estimated the number of hyperplanes needed in classification scenarios.
Determined training sample sizes for reliable supervised learning.
Abstract
The Statistical Learning Theory (SLT) provides the foundation to ensure that a supervised algorithm generalizes the mapping given is selected from its search space bias . SLT depends on the Shattering coefficient function to upper bound the empirical risk minimization principle, from which one can estimate the necessary training sample size to ensure the probabilistic learning convergence and, most importantly, the characterization of the capacity of , including its underfitting and overfitting abilities while addressing specific target problems. However, the analytical solution of the Shattering coefficient is still an open problem since the first studies by Vapnik and Chervonenkis in , which we address on specific datasets, in this paper, by employing equivalence relations from Topology,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Domain Adaptation and Few-Shot Learning
