Standardness Clouds Meaning: A Position Regarding the Informed Usage of   Standard Datasets

Tim Cech; Ole Wegen; Daniel Atzberger; Rico Richter; Willy Scheibel,; J\"urgen D\"ollner

arXiv:2406.13552·cs.LG·January 8, 2025·2 cites

Standardness Clouds Meaning: A Position Regarding the Informed Usage of Standard Datasets

Tim Cech, Ole Wegen, Daniel Atzberger, Rico Richter, Willy Scheibel,, J\"urgen D\"ollner

PDF

Open Access

TL;DR

This paper critically examines the use of standard datasets in Machine Learning, demonstrating that their labels may not always match the intended categories, which can impair model trust and effectiveness.

Contribution

It introduces a method combining Grounded Theory and Hypotheses Testing via Visualization to evaluate dataset label quality and applicability.

Findings

01

20 Newsgroups labels are imprecise, affecting model learning.

02

MNIST labels are well-defined, supporting effective learning.

03

Critical assessment of datasets enhances trust and model validity.

Abstract

Standard datasets are frequently used to train and evaluate Machine Learning models. However, the assumed standardness of these datasets leads to a lack of in-depth discussion on how their labels match the derived categories for the respective use case, which we demonstrate by reviewing recent literature that employs standard datasets. We find that the standardness of the datasets seems to cloud their actual coherency and applicability, thus impeding the trust in Machine Learning models trained on these datasets. Therefore, we argue against the uncritical use of standard datasets and advocate for their critical examination instead. For this, we suggest to use Grounded Theory in combination with Hypotheses Testing through Visualization as methods to evaluate the match between use case, derived categories, and labels. We exemplify this approach by applying it to the 20 Newsgroups dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · Artificial Intelligence in Healthcare and Education