Towards Knowledge Guided Pretraining Approaches for Multimodal Foundation Models: Applications in Remote Sensing

Praveen Ravirathinam; Ajitesh Parthasarathy; Ankush Khandelwal; Rahul Ghosh; Vipin Kumar

arXiv:2407.19660·cs.CV·March 30, 2026

Towards Knowledge Guided Pretraining Approaches for Multimodal Foundation Models: Applications in Remote Sensing

Praveen Ravirathinam, Ajitesh Parthasarathy, Ankush Khandelwal, Rahul Ghosh, Vipin Kumar

PDF

TL;DR

This paper introduces KG-VSF, a knowledge-guided pretraining method for multimodal models in remote sensing that captures causal relationships, improving downstream task performance.

Contribution

It proposes a novel pretraining task that models forecasting as conditional generation, integrating causal knowledge into foundation models for remote sensing.

Findings

01

Pretraining with KG-VSF enhances embeddings for downstream tasks.

02

Improved performance in crop mapping, soil moisture estimation, and image forecasting.

03

Outperforms standard pretraining approaches in causality-sensitive tasks.

Abstract

Self-supervised learning has emerged as a powerful paradigm for pretraining foundation models using large-scale data. Existing pretraining approaches predominantly rely on masked reconstruction or next-token prediction strategies, demonstrating strong performance across various downstream tasks, including geoscience applications. However, these approaches do not fully capture the knowledge of causal interplay between different geospatial and environmental variables. To address this limitation, we propose Knowledge Guided Variable-Step Forecasting (KG-VSF), a novel pretraining task that models forecasting as a conditional generation task, where driver variables (e.g., weather) inform the prediction of response variables (e.g., satellite imagery). We demonstrate that pretraining in such a fashion leads to strong embeddings which give enhanced performance when finetuned on downstream tasks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.