Bridging Perception and Action: A Lightweight Multimodal Meta-Planner Framework for Robust Earth Observation Agents

Jinghui Xu; Boyi Shangguan; Mengke Zhu; Hao Liu; Junhuan Jiang; Guangjun He; Pengming Feng; Shichao Jin; Bin Liang; Yongzhe Chang; Junbo Tan; Tiantian Zhang; Xueqian Wang

arXiv:2605.04777·cs.MA·May 7, 2026

Bridging Perception and Action: A Lightweight Multimodal Meta-Planner Framework for Robust Earth Observation Agents

Jinghui Xu, Boyi Shangguan, Mengke Zhu, Hao Liu, Junhuan Jiang, Guangjun He, Pengming Feng, Shichao Jin, Bin Liang, Yongzhe Chang, Junbo Tan, Tiantian Zhang, Xueqian Wang

PDF

TL;DR

The paper introduces LMMP, a lightweight multimodal meta-planner for Earth Observation agents that improves planning accuracy and robustness by integrating expert knowledge and a dual-awareness mechanism.

Contribution

It presents a novel framework combining multimodal grounding, expert knowledge injection, and a two-stage training pipeline for more reliable EO planning.

Findings

01

Significantly improves tool-calling accuracy and task success rates.

02

Enhances performance across diverse EO missions and executor backbones.

03

Demonstrates robustness in dynamic, real-world EO scenarios.

Abstract

Autonomous Earth Observation (EO) agents are transitioning from passive perception to complex, multi-step task execution. However, current architectures that integrate planning and execution within a single model often struggle with combinatorial complexity and reasoning errors in dynamic EO scenarios. To resolve these challenges, we propose the Lightweight Multimodal Meta-Planner (LMMP) framework. LMMP incorporates a dual-awareness mechanism that grounds strategic plans in both multimodal image features and high-level task semantics. Crucially, we introduce a Meta Task Library to inject remote sensing expert knowledge directly into the workflow, which standardizes domain logic and ensures plans are physically feasible. We further implement a two-stage training pipeline, initializing the Meta-Planner via expert-distilled Supervised Fine-Tuning and refining it through Direct Preference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.