Modeling Generalization in Machine Learning: A Methodological and   Computational Study

Pietro Barbiero; Giovanni Squillero; Alberto Tonda

arXiv:2006.15680·cs.LG·June 30, 2020·28 cites

Modeling Generalization in Machine Learning: A Methodological and Computational Study

Pietro Barbiero, Giovanni Squillero, Alberto Tonda

PDF

Open Access 1 Repo

TL;DR

This paper investigates how various data characteristics influence machine learning generalization, emphasizing the importance of the convex hull concept and challenging assumptions about the curse of dimensionality.

Contribution

It provides a comprehensive meta-analysis linking data set features to generalization performance, highlighting the role of convex hulls and questioning dimensionality effects.

Findings

01

Convex hull analysis effectively assesses generalization.

02

Weak correlation between dimensionality and generalization.

03

Challenging the curse of dimensionality's impact on generalization.

Abstract

As machine learning becomes more and more available to the general public, theoretical questions are turning into pressing practical issues. Possibly, one of the most relevant concerns is the assessment of our confidence in trusting machine learning predictions. In many real-world cases, it is of utmost importance to estimate the capabilities of a machine learning algorithm to generalize, i.e., to provide accurate predictions on unseen data, depending on the characteristics of the target problem. In this work, we perform a meta-analysis of 109 publicly-available classification data sets, modeling machine learning generalization as a function of a variety of data set characteristics, ranging from number of samples to intrinsic dimensionality, from class-wise feature skewness to $F 1$ evaluated on test samples falling outside the convex hull of the training set. Experimental results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pietrobarbiero/dataset-characteristics
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Machine Learning and Algorithms · Imbalanced Data Classification Techniques