Self-Supervised Multi-Modal World Model with 4D Space-Time Embedding

Lance Legel; Qin Huang; Brandon Voelker; Daniel Neamati; Patrick Alan Johnson; Favyen Bastani; Jeff Rose; James Ryan Hennessy; Robert Guralnick; Douglas Soltis; Pamela Soltis; Shaowen Wang

arXiv:2603.07039·cs.AI·March 10, 2026

Self-Supervised Multi-Modal World Model with 4D Space-Time Embedding

Lance Legel, Qin Huang, Brandon Voelker, Daniel Neamati, Patrick Alan Johnson, Favyen Bastani, Jeff Rose, James Ryan Hennessy, Robert Guralnick, Douglas Soltis, Pamela Soltis, Shaowen Wang

PDF

Open Access

TL;DR

DeepEarth introduces a novel 4D space-time encoder for planetary-scale modeling, integrating multi-modal data with self-supervised training to achieve state-of-the-art ecological forecasting performance.

Contribution

The paper presents Earth4D, a scalable 4D space-time positional encoder that extends existing 3D hash encoding to include temporal information, enabling efficient planetary-scale modeling.

Findings

01

Earth4D achieves state-of-the-art ecological forecasting results.

02

Learnable hash probing with Earth4D outperforms larger pre-trained multi-modal models.

03

The approach scales efficiently across centuries with high spatial and temporal resolution.

Abstract

We present DeepEarth, a self-supervised multi-modal world model with Earth4D, a novel planetary-scale 4D space-time positional encoder. Earth4D extends 3D multi-resolution hash encoding to include time, efficiently scaling across the planet over centuries with sub-meter, sub-second precision. Multi-modal encoders (e.g. vision-language models) are fused with Earth4D embeddings and trained via masked reconstruction. We demonstrate Earth4D's expressive power by achieving state-of-the-art performance on an ecological forecasting benchmark. Earth4D with learnable hash probing surpasses a multi-modal foundation model pre-trained on substantially more data. Access open source code and download models at: https://github.com/legel/deepearth

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques · Face recognition and analysis