On the Complexity of Labeled Datasets

Rodrigo Fernandes de Mello

arXiv:1911.05461·cs.LG·July 20, 2021·1 cites

On the Complexity of Labeled Datasets

Rodrigo Fernandes de Mello

PDF

Open Access

TL;DR

This paper addresses the open problem of analytically computing the Shattering coefficient in Statistical Learning Theory, providing methods to estimate dataset complexity, sample sizes, and embedding effects for both binary and multi-class datasets.

Contribution

It introduces a novel approach using Topology, combinatorics, and data separability to compute the Shattering coefficient, advancing understanding of dataset complexity and sample requirements.

Findings

01

Computed Shattering coefficient for binary and multi-class datasets.

02

Estimated the number of hyperplanes needed in classification scenarios.

03

Determined training sample sizes for reliable supervised learning.

Abstract

The Statistical Learning Theory (SLT) provides the foundation to ensure that a supervised algorithm generalizes the mapping $f : X \to Y$ given $f$ is selected from its search space bias $F$ . SLT depends on the Shattering coefficient function $N (F, n)$ to upper bound the empirical risk minimization principle, from which one can estimate the necessary training sample size to ensure the probabilistic learning convergence and, most importantly, the characterization of the capacity of $F$ , including its underfitting and overfitting abilities while addressing specific target problems. However, the analytical solution of the Shattering coefficient is still an open problem since the first studies by Vapnik and Chervonenkis in $1962$ , which we address on specific datasets, in this paper, by employing equivalence relations from Topology,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Domain Adaptation and Few-Shot Learning