Beyond Point Estimates: Distributional Uncertainty in Machine Learning Performance Evaluation
Christoph Lehmann, Yahor Paromau

TL;DR
This paper introduces a distributional approach to evaluate machine learning models by analyzing the variability of performance metrics as random variables, especially useful for small sample sizes.
Contribution
It proposes methods for empirical distribution analysis of performance metrics, enabling statistical inference of variability and uncertainty in model evaluation.
Findings
Feasible statistical inference on performance distribution with small samples (10-25)
Standard confidence intervals remain valid for small sample sizes
Distributional evaluation offers more detailed model comparison and risk assessment
Abstract
Machine learning models are often evaluated using point estimates of performance metrics such as accuracy, F1 score, or mean squared error. Such summaries fail to capture the inherent variability induced by stochastic elements of the training process, including data splitting, initialization, and hyperparameter optimization. This work proposes a distributional perspective on model evaluation by treating performance metrics as random quantities rather than fixed values. Instead of focusing solely on aggregate measures, empirical distributions of performance metrics are analyzed using quantiles and corresponding confidence intervals. The study investigates point and interval estimation of quantiles based on real-data use cases for classification and regression tasks, complemented by simulation studies for validation. Special emphasis is placed on small sample sizes, reflecting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
