X-Factor: Quality Is a Dataset-Intrinsic Property
Josiah Couch, Miao Li, Rima Arnaout, Ramy Arnaout

TL;DR
This paper demonstrates that dataset quality is an intrinsic property independent of model architecture, size, and class balance, and is primarily determined by the quality of individual classes within the dataset.
Contribution
It provides empirical evidence that dataset quality is an intrinsic, architecture-independent property, linked to class-level quality, and not solely dependent on size or balance.
Findings
Dataset quality correlates strongly across different classifiers ($R^2=0.79$).
Quality is an emergent property of class-level characteristics.
Dataset size and class balance do not fully explain performance variations.
Abstract
In the universal quest to optimize machine-learning classifiers, three factors -- model architecture, dataset size, and class balance -- have been shown to influence test-time performance but do not fully account for it. Previously, evidence was presented for an additional factor that can be referred to as dataset quality, but it was unclear whether this was actually a joint property of the dataset and the model architecture, or an intrinsic property of the dataset itself. If quality is truly dataset-intrinsic and independent of model architecture, dataset size, and class balance, then the same datasets should perform better (or worse) regardless of these other factors. To test this hypothesis, here we create thousands of datasets, each controlled for size and class balance, and use them to train classifiers with a wide range of architectures, from random forests and support-vector…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
