What Are We Really Measuring? Rethinking Dataset Bias in Web-Scale Natural Image Collections via Unsupervised Semantic Clustering

Amir Hossein Saleknia; Mohammad Sabokrou

arXiv:2604.13610·cs.CV·April 16, 2026

What Are We Really Measuring? Rethinking Dataset Bias in Web-Scale Natural Image Collections via Unsupervised Semantic Clustering

Amir Hossein Saleknia, Mohammad Sabokrou

PDF

TL;DR

This paper reveals that dataset bias in large-scale natural image collections is often driven by superficial resolution artifacts rather than true semantic differences, and proposes an unsupervised clustering method to better measure semantic separability.

Contribution

It introduces an unsupervised framework for assessing semantic similarity that uncovers overestimated dataset bias caused by superficial cues in supervised classification methods.

Findings

01

Supervised dataset classification accuracy is heavily influenced by resolution artifacts.

02

Models can distinguish datasets using non-semantic, procedurally generated images.

03

Unsupervised clustering shows that true semantic separability is much lower than supervised methods suggest.

Abstract

In computer vision, a prevailing method for quantifying dataset bias is to train a model to distinguish between datasets. High classification accuracy is then interpreted as evidence of meaningful semantic differences. This approach assumes that standard image augmentations successfully suppress low-level, non-semantic cues, and that any remaining performance must therefore reflect true semantic divergence. We demonstrate that this fundamental assumption is flawed within the domain of large-scale natural image collections. High classification accuracy is often driven by resolution-based artifacts, which are structural fingerprints arising from native image resolution distributions and interpolation effects during resizing. These artifacts form robust, dataset-specific signatures that persist despite conventional image corruptions. Through controlled experiments, we show that models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.