Modeling Generalization in Machine Learning: A Methodological and Computational Study
Pietro Barbiero, Giovanni Squillero, Alberto Tonda

TL;DR
This paper investigates how various data characteristics influence machine learning generalization, emphasizing the importance of the convex hull concept and challenging assumptions about the curse of dimensionality.
Contribution
It provides a comprehensive meta-analysis linking data set features to generalization performance, highlighting the role of convex hulls and questioning dimensionality effects.
Findings
Convex hull analysis effectively assesses generalization.
Weak correlation between dimensionality and generalization.
Challenging the curse of dimensionality's impact on generalization.
Abstract
As machine learning becomes more and more available to the general public, theoretical questions are turning into pressing practical issues. Possibly, one of the most relevant concerns is the assessment of our confidence in trusting machine learning predictions. In many real-world cases, it is of utmost importance to estimate the capabilities of a machine learning algorithm to generalize, i.e., to provide accurate predictions on unseen data, depending on the characteristics of the target problem. In this work, we perform a meta-analysis of 109 publicly-available classification data sets, modeling machine learning generalization as a function of a variety of data set characteristics, ranging from number of samples to intrinsic dimensionality, from class-wise feature skewness to evaluated on test samples falling outside the convex hull of the training set. Experimental results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Machine Learning and Algorithms · Imbalanced Data Classification Techniques
