A Comprehensive Assessment Benchmark for Rigorously Evaluating Deep Learning Image Classifiers
Michael W. Spratling

TL;DR
This paper introduces a comprehensive benchmark for evaluating deep learning image classifiers across diverse data types, revealing current models' vulnerabilities and emphasizing the need for more robust evaluation methods.
Contribution
It proposes a unified evaluation benchmark using multiple data types and a single metric, highlighting the limitations of current evaluation protocols.
Findings
Current deep neural networks are vulnerable to certain data types.
State-of-the-art models are not reliably robust in diverse scenarios.
Models can be easily fooled into incorrect predictions.
Abstract
Reliable and robust evaluation methods are a necessary first step towards developing machine learning models that are themselves robust and reliable. Unfortunately, current evaluation protocols typically used to assess classifiers fail to comprehensively evaluate performance as they tend to rely on limited types of test data, and ignore others. For example, using the standard test data fails to evaluate the predictions made by the classifier to samples from classes it was not trained on. On the other hand, testing with data containing samples from unknown classes fails to evaluate how well the classifier can predict the labels for known classes. This article advocates benchmarking performance using a wide range of different types of data and using a single metric that can be applied to all such data types to produce a consistent evaluation of performance. Using the proposed benchmark it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Anomaly Detection Techniques and Applications · Adversarial Robustness in Machine Learning
Methodsfail
