Quantifying Geospatial in the Common Crawl Corpus
Ilya Ilyankou, Meihui Wang, Stefano Cavazzi, James Haworth

TL;DR
This paper quantifies the presence of geospatial data in the Common Crawl corpus, revealing that approximately 18.7% of documents contain geospatial information, which informs understanding of LLMs' spatial reasoning capabilities.
Contribution
It provides the first systematic estimate of geospatial content in Common Crawl, highlighting its prevalence and potential biases relevant to LLM spatial reasoning.
Findings
18.7% of CC documents contain geospatial info
No significant difference between English and non-English documents
Provides baseline for future geospatial bias studies in LLMs
Abstract
Large language models (LLMs) exhibit emerging geospatial capabilities, stemming from their pre-training on vast unlabelled text datasets that are often derived from the Common Crawl (CC) corpus. However, the geospatial content within CC remains largely unexplored, impacting our understanding of LLMs' spatial reasoning. This paper investigates the prevalence of geospatial data in recent Common Crawl releases using Gemini 1.5, a powerful language model. By analyzing a sample of documents and manually revising the results, we estimate that 18.7% of web documents in CC contain geospatial information such as coordinates and addresses. We find little difference in prevalence between Enlgish- and non-English-language documents. Our findings provide quantitative insights into the nature and extent of geospatial data in CC, and lay the groundwork for future studies of geospatial biases of LLMs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
