Data organization limits the predictability of binary classification
Fei Jing, Zi-Ke Zhang, Yi-Cheng Zhang, Qingpeng Zhang

TL;DR
This paper establishes that the maximum achievable performance of binary classifiers is fundamentally limited by the data's inherent qualities, with theoretical bounds linked to dataset characteristics and class overlap.
Contribution
It introduces a theoretical framework for understanding data-imposed limits on binary classification performance and computes upper bounds for common evaluation metrics.
Findings
Theoretical upper bounds can be attained on real datasets.
Upper bounds for evaluation metrics are linked to dataset properties.
Class overlap influences the maximum achievable classifier performance.
Abstract
The structure of data organization is widely recognized as having a substantial influence on the efficacy of machine learning algorithms, particularly in binary classification tasks. Our research provides a theoretical framework suggesting that the maximum potential of binary classifiers on a given dataset is primarily constrained by the inherent qualities of the data. Through both theoretical reasoning and empirical examination, we employed standard objective functions, evaluative metrics, and binary classifiers to arrive at two principal conclusions. Firstly, we show that the theoretical upper bound of binary classification performance on actual datasets can be theoretically attained. This upper boundary represents a calculable equilibrium between the learning loss and the metric of evaluation. Secondly, we have computed the precise upper bounds for three commonly used evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Face and Expression Recognition · Imbalanced Data Classification Techniques
