Does the evaluation stand up to evaluation? A first-principle approach to the evaluation of classifiers
K. Dyrland, A. S. Lundervold, P.G.L. Porta Mana

TL;DR
This paper critiques current classifier evaluation methods, proposing a decision-theoretic foundation that emphasizes linear combinations of confusion-matrix elements with problem-specific utilities, revealing many popular metrics are suboptimal.
Contribution
It introduces a decision-theoretic framework for classifier evaluation, showing that optimal metrics are linear combinations of confusion-matrix elements with tailored utilities, and demonstrates the limitations of common metrics.
Findings
Popular metrics are never optimal under the proposed framework.
Evaluation metrics should be tailored to specific problems using utilities.
Many existing metrics can lead to avoidable evaluation errors.
Abstract
How can one meaningfully make a measurement, if the meter does not conform to any standard and its scale expands or shrinks depending on what is measured? In the present work it is argued that current evaluation practices for machine-learning classifiers are affected by this kind of problem, leading to negative consequences when classifiers are put to real use; consequences that could have been avoided. It is proposed that evaluation be grounded on Decision Theory, and the implications of such foundation are explored. The main result is that every evaluation metric must be a linear combination of confusion-matrix elements, with coefficients - "utilities" - that depend on the specific classification problem. For binary classification, the space of such possible metrics is effectively two-dimensional. It is shown that popular metrics such as precision, balanced accuracy, Matthews…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Fault Detection and Control Systems
