Asking like Socrates: Socrates helps VLMs understand remote sensing images
Run Shao, Ziyu Li, Zhaoyang Zhang, Linrui Xu, Xinran He, Hongyuan Yuan, Bolei He, Yongxing Dai, Yiming Yan, Yijun Chen, Wang Guo, Haifeng Li

TL;DR
This paper introduces RS-EoT, an iterative, evidence-seeking reasoning paradigm for remote sensing vision-language tasks, addressing pseudo reasoning caused by the Glance Effect and achieving state-of-the-art results.
Contribution
It proposes a novel SocraticAgent system with a two-stage reinforcement learning strategy to improve genuine evidence-based reasoning in remote sensing models.
Findings
RS-EoT achieves state-of-the-art performance on multiple benchmarks.
The approach mitigates the Glance Effect, enabling more accurate reasoning.
Iterative reasoning cycles are confirmed through analysis.
Abstract
Recent multimodal reasoning models, inspired by DeepSeek-R1, have significantly advanced vision-language systems. However, in remote sensing (RS) tasks, we observe widespread pseudo reasoning: models narrate the process of reasoning rather than genuinely reason toward the correct answer based on visual evidence. We attribute this to the Glance Effect, where a single, coarse perception of large-scale RS imagery results in incomplete understanding and reasoning based on linguistic self-consistency instead of visual evidence. To address this, we propose RS-EoT (Remote Sensing Evidence-of-Thought), a language-driven, iterative visual evidence-seeking paradigm. To instill this paradigm, we propose SocraticAgent, a self-play multi-agent system that synthesizes reasoning traces via alternating cycles of reasoning and visual inspection. To enhance and generalize these patterns, we propose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
