TL;DR
UNIGEOCLIP is a novel multimodal contrastive learning framework that aligns five geospatial data modalities in a unified space, enabling improved geospatial task performance.
Contribution
It introduces all-to-all contrastive alignment for multiple geospatial modalities and a scaled latitude-longitude encoder for better spatial representation.
Findings
Outperforms single-modality contrastive models on geospatial tasks.
Enables seamless comparison and retrieval across multiple geospatial data types.
Demonstrates the effectiveness of holistic multimodal geospatial alignment.
Abstract
The growing availability of co-located geospatial data spanning aerial imagery, street-level views, elevation models, text, and geographic coordinates offers a unique opportunity for multimodal representation learning. We introduce UNIGEOCLIP, a massively multimodal contrastive framework to jointly align five complementary geospatial modalities in a single unified embedding space. Unlike prior approaches that fuse modalities or rely on a central pivot representation, our method performs all-to-all contrastive alignment, enabling seamless comparison, retrieval, and reasoning across arbitrary combinations of modalities. We further propose a scaled latitude-longitude encoder that improves spatial representation by capturing multi-scale geographic structure. Extensive experiments across downstream geospatial tasks demonstrate that UNIGEOCLIP consistently outperforms single-modality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
