RemoteShield: Enable Robust Multimodal Large Language Models for Earth Observation
Rui Min, Liang Yao, Shiyu Miao, Shengxiang Xu, Yuxuan Liu, Chuanyi Zhang, Shimin Di, Fan Liu

TL;DR
RemoteShield is a new multimodal large language model for Earth Observation that maintains consistent reasoning under realistic visual and textual input variations, improving robustness over existing models.
Contribution
It introduces a training method using preference learning over clean and perturbed data to enhance model robustness and consistency in Earth Observation tasks.
Findings
RemoteShield outperforms baselines in robustness under multimodal perturbations.
The model maintains consistent outputs across visual degradations like clouds and fog.
It demonstrates improved performance on multiple Earth Observation tasks.
Abstract
A robust Multimodal Large Language Model (MLLM) for Earth Observation should maintain consistent interpretation and reasoning under realistic input variations. However, current Remote Sensing MLLMs fail to meet this requirement. Trained on carefully curated clean datasets, they learn brittle mappings that do not generalize to noisy conditions in operational Earth Observation. Consequently, their performance degrades when confronted with imperfect inputs in deployment. To quantify this vulnerability, we construct a realistic set of multimodal perturbations, including visual degradations such as cloud and fog cover, together with diverse human-centric textual variations ranging from colloquialisms to vague or omitted instructions. Empirical evaluations show that these perturbations significantly impair the visual-semantic reasoning capabilities of leading RS foundation models. To address…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
