World Properties without World Models: Recovering Spatial and Temporal Structure from Co-occurrence Statistics in Static Word Embeddings
Elan Barenholtz

TL;DR
Static word embeddings like GloVe and Word2Vec inherently contain significant geographic and temporal information, demonstrating that text alone preserves rich world-structured data without requiring complex models.
Contribution
This study shows that simple static embeddings encode substantial geographic and temporal signals, challenging the assumption that world-like representations need complex internal models.
Findings
High recoverability of geographic coordinates from embeddings (R^2 0.71-0.87)
Moderate recoverability of temporal data (R^2 0.48-0.52)
Signals depend on lexical gradients like country names and climate vocabulary
Abstract
Recent work interprets the linear recoverability of geographic and temporal variables from large language model (LLM) hidden states as evidence for world-like internal representations. We test a simpler possibility: that much of the relevant structure is already latent in text itself. Applying the same class of ridge regression probes to static co-occurrence-based embeddings (GloVe and Word2Vec), we find substantial recoverable geographic signal and weaker but reliable temporal signal, with held-out R^2 values of 0.71-0.87 for city coordinates and 0.48-0.52 for historical birth years. Semantic-neighbor analyses and targeted subspace ablations show that these signals depend strongly on interpretable lexical gradients, especially country names and climate-related vocabulary. These findings suggest that ordinary word co-occurrence preserves richer spatial, temporal, and environmental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage and cultural evolution · Computational and Text Analysis Methods · Complex Systems and Time Series Analysis
