Self-Supervised Multi-Modal World Model with 4D Space-Time Embedding
Lance Legel, Qin Huang, Brandon Voelker, Daniel Neamati, Patrick Alan Johnson, Favyen Bastani, Jeff Rose, James Ryan Hennessy, Robert Guralnick, Douglas Soltis, Pamela Soltis, Shaowen Wang

TL;DR
DeepEarth introduces a novel 4D space-time encoder for planetary-scale modeling, integrating multi-modal data with self-supervised training to achieve state-of-the-art ecological forecasting performance.
Contribution
The paper presents Earth4D, a scalable 4D space-time positional encoder that extends existing 3D hash encoding to include temporal information, enabling efficient planetary-scale modeling.
Findings
Earth4D achieves state-of-the-art ecological forecasting results.
Learnable hash probing with Earth4D outperforms larger pre-trained multi-modal models.
The approach scales efficiently across centuries with high spatial and temporal resolution.
Abstract
We present DeepEarth, a self-supervised multi-modal world model with Earth4D, a novel planetary-scale 4D space-time positional encoder. Earth4D extends 3D multi-resolution hash encoding to include time, efficiently scaling across the planet over centuries with sub-meter, sub-second precision. Multi-modal encoders (e.g. vision-language models) are fused with Earth4D embeddings and trained via masked reconstruction. We demonstrate Earth4D's expressive power by achieving state-of-the-art performance on an ecological forecasting benchmark. Earth4D with learnable hash probing surpasses a multi-modal foundation model pre-trained on substantially more data. Access open source code and download models at: https://github.com/legel/deepearth
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques · Face recognition and analysis
