Quantifying Geospatial in the Common Crawl Corpus

Ilya Ilyankou; Meihui Wang; Stefano Cavazzi; James Haworth

arXiv:2406.04952·cs.CL·May 7, 2026·1 cites

Quantifying Geospatial in the Common Crawl Corpus

Ilya Ilyankou, Meihui Wang, Stefano Cavazzi, James Haworth

PDF

TL;DR

This paper quantifies the presence of geospatial data in the Common Crawl corpus, revealing that approximately 18.7% of documents contain geospatial information, which informs understanding of LLMs' spatial reasoning capabilities.

Contribution

It provides the first systematic estimate of geospatial content in Common Crawl, highlighting its prevalence and potential biases relevant to LLM spatial reasoning.

Findings

01

18.7% of CC documents contain geospatial info

02

No significant difference between English and non-English documents

03

Provides baseline for future geospatial bias studies in LLMs

Abstract

Large language models (LLMs) exhibit emerging geospatial capabilities, stemming from their pre-training on vast unlabelled text datasets that are often derived from the Common Crawl (CC) corpus. However, the geospatial content within CC remains largely unexplored, impacting our understanding of LLMs' spatial reasoning. This paper investigates the prevalence of geospatial data in recent Common Crawl releases using Gemini 1.5, a powerful language model. By analyzing a sample of documents and manually revising the results, we estimate that 18.7% of web documents in CC contain geospatial information such as coordinates and addresses. We find little difference in prevalence between Enlgish- and non-English-language documents. Our findings provide quantitative insights into the nature and extent of geospatial data in CC, and lay the groundwork for future studies of geospatial biases of LLMs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.