GeoViSTA: Geospatial Vision-Tabular Transformer for Multimodal Environment Representation
Yuhao Liu, Sadeer Al-Kindi, Ashok Veeraraghavan, Guha Balakrishnan

TL;DR
GeoViSTA introduces a novel vision-tabular transformer that unifies geospatial imagery and socioeconomic data, enhancing environmental and social outcome predictions through self-supervised learning.
Contribution
It proposes a bilateral cross-attention architecture with geography-aware alignment for joint geospatial embedding of imagery and tabular data.
Findings
Improves linear probing performance on downstream geospatial tasks.
Outperforms baselines in predicting disease mortality and fire hazard.
Demonstrates the benefit of joint physical and socioeconomic environment modeling.
Abstract
Large-scale pretraining on Earth observation imagery has yielded powerful representations of the natural and built environment. However, most existing geospatial foundation models do not directly model the structured socioeconomic covariates typically stored in tabular form. This modality gap limits their ability to capture the complete total environment, which is critical for reasoning about complex environmental, social, and health-related outcomes. In this work, we propose GeoViSTA (Geospatial Vision-Tabular Transformer), a vision-tabular architecture that learns unified geospatial embeddings from co-registered gridded imagery and tabular data. GeoViSTA utilizes bilateral cross-attention to exchange spatial and semantic information across modalities, guided by a geography-aware attention mechanism that aligns continuous image patches with irregular census-tract tokens. We train…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
