What Makes ImageNet Look Unlike LAION
Ali Shirali, Moritz Hardt

TL;DR
This paper investigates how recreating ImageNet from LAION dataset captions results in a dataset, LAIONet, that differs significantly from the original, highlighting the impact of data collection methods on dataset characteristics and model performance.
Contribution
It introduces LAIONet, a new ImageNet-like dataset created from caption-based searches, and provides a causal explanation for differences in dataset quality and model transferability.
Findings
LAIONet has lower intra-class similarity than ImageNet.
Models trained on ImageNet perform worse on LAIONet.
Caption-based search creates an information bottleneck affecting dataset bias.
Abstract
ImageNet was famously created from Flickr image search results. What if we recreated ImageNet instead by searching the massive LAION dataset based on image captions alone? In this work, we carry out this counterfactual investigation. We find that the resulting ImageNet recreation, which we call LAIONet, looks distinctly unlike the original. Specifically, the intra-class similarity of images in the original ImageNet is dramatically higher than it is for LAIONet. Consequently, models trained on ImageNet perform significantly worse on LAIONet. We propose a rigorous explanation for the discrepancy in terms of a subtle, yet important, difference in two plausible causal data-generating processes for the respective datasets, that we support with systematic experimentation. In a nutshell, searching based on an image caption alone creates an information bottleneck that mitigates the selection…
Peer Reviews
Decision·Submitted to ICLR 2024
- The viewpoint of connecting and comparing older and newer datasets is interesting. - The writing is generally clear and easy to follow.
- The only conclusion of this paper is that ImageNet is more of an easy dataset than LAION because the images are curated dependent on image similarities, which makes images of each class less diverse and has smaller intra-class variances. This conclusion is unsurprising since ImageNet is curated very carefully to exclude outlier examples. - I do not see much value of the findings. Visual datasets should not be curated only using text descriptions, which leads to a higher probability of getting
- Analyzing mainstream datasets helps deepen researchers' understanding of the data. At the same time, it aids the community in designing future datasets with minimal human-induced bias, which in turn helps enhance the generalization performance of models. - This paper is logically structured, and the conclusions regarding the differences between the ImageNet and LAION datasets are comprehensive. Starting from the inconsistent dataset filtering processes, it further analyzes the differences in
- This paper still lacks a central objective. Although a series of analyses point out the differences between ImageNet and LAIONet, both Figure1 and Figure5 seem to indicate that model performance on ImageNet and LAIONet is positively correlated. This suggests that LAIONet doesn't offer additional indicative value for model performance analysis, which is typically the most important for classification datasets. - Additionally, the ImageNet dataset and the LAION dataset were created at different
- The paper looks into the data creation process and how to mitigate biases in the process, which is important for the community - The paper is easy to read and understand, all the experiments are explained very clearly
- The paper claims in Section 1.1 "Choosing an image reveals nothing more about the image than what can be learned from its textual representation. This powerful conditional independence property limits how much selection can bias the distribution of the image. In contrast, in the case of ImageNet (Figure 2b), there is a link from the image to the selection decision.". This isn't accurate -- choosing an image gives more information than the text representation which is used for LAIONet selection
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · COVID-19 diagnosis using AI
