EarthMind: Leveraging Cross-Sensor Data for Advanced Earth Observation Interpretation with a Unified Multimodal LLM
Yan Shu, Bin Ren, Zhitong Xiong, Danda Pani Paudel, Luc Van Gool, Beg\"um Demir, Nicu Sebe, Paolo Rota

TL;DR
EarthMind introduces a unified multimodal large language model that effectively integrates heterogeneous sensor data, such as optical and SAR images, for advanced Earth observation interpretation, surpassing existing models in accuracy and versatility.
Contribution
The paper presents a novel hierarchical cross-modal attention mechanism and a curated dataset, enabling cross-sensor learning and improving multimodal Earth observation analysis.
Findings
EarthMind achieves state-of-the-art performance on EO benchmarks.
The hierarchical attention effectively fuses multi-sensor data.
The curated dataset supports diverse perception and reasoning tasks.
Abstract
Earth Observation (EO) data analysis is vital for monitoring environmental and human dynamics. Recent Multimodal Large Language Models (MLLMs) show potential in EO understanding but remain restricted to single-sensor inputs, overlooking the complementarity across heterogeneous modalities. We propose EarthMind, a unified vision-language framework that handles both single- and cross-sensor inputs via an innovative hierarchical cross-modal attention (ie, HCA) design. Specifically, HCA hierarchically captures visual relationships across sensors and aligns them with language queries, enabling adaptive fusion of optical and Synthetic Aperture Radar (SAR) features. To support cross-sensor learning, we curate FusionEO, a 30K-pair dataset with diverse annotations, and establish EarthMind-Bench, a 2,841-pair benchmark with expert annotations for perception and reasoning tasks. Extensive…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper addresses a relevant problem in multimodal Earth observation by exploring adaptive fusion of optical and SAR data. The proposed HCA (Hybrid Cross-Attention) mechanism and FusionEO benchmark are reasonably designed and provide a useful reference for multimodal learning in remote sensing. The work is generally well-motivated, and the experimental setup is clear and systematic. The introduction of the MAS (Modality Attention Score) offers a straightforward way to interpret attention alloc
(1)The authors emphasize the contribution of multi-sensor data, even specifically noting in the Introduction (line 44) that Sentinel-2 can provide high-resolution multispectral imagery. In the Related Work section (lines 135–136), they also highlight that *EarthDial* (CVPR 2025) utilizes multiple modalities, including multispectral, hyperspectral, and synthetic aperture radar (SAR) data. These details suggest that the authors are well aware of the importance of spectral sensor data. However, th
1.Use a single LLM to unify single/multi-sensor inputs and multi-granularity tasks, and integrate segmentation seamlessly into the inference pipeline via the [SEG] token. 2.Define the MAS metric to quantify modality attention shares; empirically show that naive concatenation is biased toward the optical modality, and use HCA for targeted debiasing to achieve more balanced and efficient multimodal fusion. 3.With only 4B parameters, it is strongly competitive on public benchmarks such as AID, UC
1. Evaluating open-ended tasks relies on GPT-4 as a judge, which inevitably introduces scoring bias and prompt sensitivity in reproduction. 2. The pixel-level part of EarthMind-Bench is mainly referring segmentation with only 438 samples, so coverage of fine-grained pixel tasks is limited. 3. Training heavily depends on general natural-image corpora with EO-domain adaptation afterward; this “general-first, adapt-later” pipeline may leave residual cross-domain gaps. 4. Segmentation is triggere
- Intruduces a bias for modalities to focus more attention on the SAR data. This apprach shows strong generalization for both modalities. - The model achieves good results compared to other MLLMs on wide range of tasks. - The authors provide a new benchmark dataset that evaluates multi-sensor tasks in EO scenarios which is very interesting.
- The architecture is not well described, particularly the mask decoder and the handling of data input. - Data and pre-training procedure lack detailed explanation and illustrative examples. - The evaluation does not represent a true zero-shot setting in all experiments: Appendix G mentions that BigEarthNet (BEN), SoSAT-LCZ42, and other EO datasets are used in pre-training, while these datasets are also used for evaluation and in EarthMind-Bench. Even with sampling from different splits can in
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Computational Techniques and Applications · Geological Modeling and Analysis
