RemoteShield: Enable Robust Multimodal Large Language Models for Earth Observation

Rui Min; Liang Yao; Shiyu Miao; Shengxiang Xu; Yuxuan Liu; Chuanyi Zhang; Shimin Di; Fan Liu

arXiv:2604.17243·cs.CV·April 21, 2026

RemoteShield: Enable Robust Multimodal Large Language Models for Earth Observation

Rui Min, Liang Yao, Shiyu Miao, Shengxiang Xu, Yuxuan Liu, Chuanyi Zhang, Shimin Di, Fan Liu

PDF

TL;DR

RemoteShield is a new multimodal large language model for Earth Observation that maintains consistent reasoning under realistic visual and textual input variations, improving robustness over existing models.

Contribution

It introduces a training method using preference learning over clean and perturbed data to enhance model robustness and consistency in Earth Observation tasks.

Findings

01

RemoteShield outperforms baselines in robustness under multimodal perturbations.

02

The model maintains consistent outputs across visual degradations like clouds and fog.

03

It demonstrates improved performance on multiple Earth Observation tasks.

Abstract

A robust Multimodal Large Language Model (MLLM) for Earth Observation should maintain consistent interpretation and reasoning under realistic input variations. However, current Remote Sensing MLLMs fail to meet this requirement. Trained on carefully curated clean datasets, they learn brittle mappings that do not generalize to noisy conditions in operational Earth Observation. Consequently, their performance degrades when confronted with imperfect inputs in deployment. To quantify this vulnerability, we construct a realistic set of multimodal perturbations, including visual degradations such as cloud and fog cover, together with diverse human-centric textual variations ranging from colloquialisms to vague or omitted instructions. Empirical evaluations show that these perturbations significantly impair the visual-semantic reasoning capabilities of leading RS foundation models. To address…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.