Where Do Images Come From? Analyzing Captions to Geographically Profile Datasets
Abhipsa Basu, Yugam Bahl, Kirti Bhagat, Preethi Seshadri, R. Venkatesh Babu, Danish Pruthi

TL;DR
This paper analyzes the geographic origins of image-caption datasets used in text-to-image models, revealing significant biases toward Western countries and correlations with economic factors, impacting diversity and representativeness.
Contribution
It introduces a method to geographically profile datasets using LLMs and provides a comprehensive analysis of dataset biases across multiple languages and regions.
Findings
Western countries dominate dataset representation.
Representation correlates strongly with GDP.
Generated images lack real-world diversity.
Abstract
Recent studies show that text-to-image models often fail to generate geographically representative images, raising concerns about the representativeness of their training data and motivating the question: which parts of the world do these training examples come from? We geographically profile large-scale multimodal datasets by mapping image-caption pairs to countries based on location information extracted from captions using LLMs. Studying English captions from three widely used datasets (Re-LAION, DataComp1B, and Conceptual Captions) across common entities (e.g., house, flag), we find that the United States, the United Kingdom, and Canada account for of samples, while South American and African countries are severely under-represented with only and of images, respectively. We observe a strong correlation between a country's GDP and its representation in…
Peer Reviews
Decision·Submitted to ICLR 2025
- The proposed method named GeoProfiler leverages the latest large language models to extract geographical information, enabling context-aware profiling. As a tool proposed for analytical purposes, it holds considerable value. - The study presented in this paper is significant in that it quantifies the geographical biases present in datasets, aiding in the creation of more fair and balanced datasets. - The analysis is backed by statistical validation, which lends a certain degree of reliabilit
- The investigation is limited to a dataset with English captions, which may result in a biased assessment of geographical representation. - While GeoProfiler is an interesting tool, it remains primarily a methodological approach that combines existing technologies. - The potential biases inherent in the LLMs and image classification models used are not deeply explored, and the discussion on these limitations is insufficient.
(1) This study proposed a unique perspective for understanding the biases in vision-language models. (2) Multiple experiments were conducted to analyze the biases in geographic distributions associated with the image-caption pairs.
(1) The contributions of the study were not well articulated. As the authors discussed in the related work section, there have been many efforts in geographically profiling datasets. Some of them focus on visual datasets, and some of them focus on textual datasets. However, the authors claimed that the uniqueness of this study lies in the focus on vision-language datasets. It is unclear why the proposed study is different from the existing methods for profiling datasets such as MS-COCO. (2) Th
- The research is methodologically sound and exhibits a good level of rigor. The authors systematically develop GeoProfiler, starting with an analysis of simple baselines like string-matching and Named Entity Recognition (NER), and demonstrating their limitations in this context. By employing the Mixtral-8x7B Instruct model, they achieve a high precision of 0.86 in mapping captions to countries. The application of GeoProfiler to the LAION2B-en dataset is thorough, involving the analysis of 1 mil
- Unfortunately, the authors seem to disregard a large body of pre-LLM work on geotagging multimodal data [1,2,3,4]. Their baseline approaches are quite simplistic (even though generally quite effective), while there are numerous text-based approaches that perform very accurately even at the level of city prediction. As a result, I have strong doubts on the accuracy and efficiency of GeoProfiler compared to pre-LLM state-of-the-art methods [2, 4]. - While the authors have tried to extend their
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Subtitles and Audiovisual Media
