Standardness Clouds Meaning: A Position Regarding the Informed Usage of Standard Datasets
Tim Cech, Ole Wegen, Daniel Atzberger, Rico Richter, Willy Scheibel,, J\"urgen D\"ollner

TL;DR
This paper critically examines the use of standard datasets in Machine Learning, demonstrating that their labels may not always match the intended categories, which can impair model trust and effectiveness.
Contribution
It introduces a method combining Grounded Theory and Hypotheses Testing via Visualization to evaluate dataset label quality and applicability.
Findings
20 Newsgroups labels are imprecise, affecting model learning.
MNIST labels are well-defined, supporting effective learning.
Critical assessment of datasets enhances trust and model validity.
Abstract
Standard datasets are frequently used to train and evaluate Machine Learning models. However, the assumed standardness of these datasets leads to a lack of in-depth discussion on how their labels match the derived categories for the respective use case, which we demonstrate by reviewing recent literature that employs standard datasets. We find that the standardness of the datasets seems to cloud their actual coherency and applicability, thus impeding the trust in Machine Learning models trained on these datasets. Therefore, we argue against the uncritical use of standard datasets and advocate for their critical examination instead. For this, we suggest to use Grounded Theory in combination with Hypotheses Testing through Visualization as methods to evaluate the match between use case, derived categories, and labels. We exemplify this approach by applying it to the 20 Newsgroups dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · Artificial Intelligence in Healthcare and Education
