Where Do Images Come From? Analyzing Captions to Geographically Profile Datasets

Abhipsa Basu; Yugam Bahl; Kirti Bhagat; Preethi Seshadri; R. Venkatesh Babu; Danish Pruthi

arXiv:2602.09775·cs.CV·February 11, 2026

Where Do Images Come From? Analyzing Captions to Geographically Profile Datasets

Abhipsa Basu, Yugam Bahl, Kirti Bhagat, Preethi Seshadri, R. Venkatesh Babu, Danish Pruthi

PDF

Open Access 3 Reviews

TL;DR

This paper analyzes the geographic origins of image-caption datasets used in text-to-image models, revealing significant biases toward Western countries and correlations with economic factors, impacting diversity and representativeness.

Contribution

It introduces a method to geographically profile datasets using LLMs and provides a comprehensive analysis of dataset biases across multiple languages and regions.

Findings

01

Western countries dominate dataset representation.

02

Representation correlates strongly with GDP.

03

Generated images lack real-world diversity.

Abstract

Recent studies show that text-to-image models often fail to generate geographically representative images, raising concerns about the representativeness of their training data and motivating the question: which parts of the world do these training examples come from? We geographically profile large-scale multimodal datasets by mapping image-caption pairs to countries based on location information extracted from captions using LLMs. Studying English captions from three widely used datasets (Re-LAION, DataComp1B, and Conceptual Captions) across $20$ common entities (e.g., house, flag), we find that the United States, the United Kingdom, and Canada account for $48.0%$ of samples, while South American and African countries are severely under-represented with only $1.8%$ and $3.8%$ of images, respectively. We observe a strong correlation between a country's GDP and its representation in…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 3

Strengths

- The proposed method named GeoProfiler leverages the latest large language models to extract geographical information, enabling context-aware profiling. As a tool proposed for analytical purposes, it holds considerable value. - The study presented in this paper is significant in that it quantifies the geographical biases present in datasets, aiding in the creation of more fair and balanced datasets. - The analysis is backed by statistical validation, which lends a certain degree of reliabilit

Weaknesses

- The investigation is limited to a dataset with English captions, which may result in a biased assessment of geographical representation. - While GeoProfiler is an interesting tool, it remains primarily a methodological approach that combines existing technologies. - The potential biases inherent in the LLMs and image classification models used are not deeply explored, and the discussion on these limitations is insufficient.

Reviewer 02Rating 6Confidence 3

Strengths

(1) This study proposed a unique perspective for understanding the biases in vision-language models. (2) Multiple experiments were conducted to analyze the biases in geographic distributions associated with the image-caption pairs.

Weaknesses

(1) The contributions of the study were not well articulated. As the authors discussed in the related work section, there have been many efforts in geographically profiling datasets. Some of them focus on visual datasets, and some of them focus on textual datasets. However, the authors claimed that the uniqueness of this study lies in the focus on vision-language datasets. It is unclear why the proposed study is different from the existing methods for profiling datasets such as MS-COCO. (2) Th

Reviewer 03Rating 5Confidence 4

Strengths

- The research is methodologically sound and exhibits a good level of rigor. The authors systematically develop GeoProfiler, starting with an analysis of simple baselines like string-matching and Named Entity Recognition (NER), and demonstrating their limitations in this context. By employing the Mixtral-8x7B Instruct model, they achieve a high precision of 0.86 in mapping captions to countries. The application of GeoProfiler to the LAION2B-en dataset is thorough, involving the analysis of 1 mil

Weaknesses

- Unfortunately, the authors seem to disregard a large body of pre-LLM work on geotagging multimodal data [1,2,3,4]. Their baseline approaches are quite simplistic (even though generally quite effective), while there are numerous text-based approaches that perform very accurately even at the level of city prediction. As a result, I have strong doubts on the accuracy and efficiency of GeoProfiler compared to pre-LLM state-of-the-art methods [2, 4]. - While the authors have tried to extend their

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Subtitles and Audiovisual Media