TL;DR
This paper introduces RS-HyRe-R1, a hybrid reward framework for remote sensing image understanding that mitigates perceptual inertia, enhances reasoning depth, and achieves state-of-the-art results on multiple vision-language tasks.
Contribution
It proposes a novel hybrid reward mechanism to address perceptual inertia in remote sensing vision-language models, improving reasoning and generalization.
Findings
Outperforms models up to 7B parameters on REC, OVD, and VQA tasks.
Achieves state-of-the-art performance with only 3B parameters.
Demonstrates strong zero-shot generalization, surpassing competitors.
Abstract
Reinforcement learning (RL) post-training substantially improves remote sensing vision-language models (RS-VLMs). However, when handling complex remote sensing imagery (RSI) requiring exhaustive visual scanning, models tend to rely on localized salient cues for rapid inference. We term this RL-induced bias "perceptual inertia". Driven by reward maximization, models favor quick outcome fitting, leading to two limitations: cognitively, overreliance on specific features impedes complete evidence construction; operationally, models struggle to flexibly shift visual focus across tasks. To address this bias and encourage comprehensive visual evidence mining, we propose RS-HyRe-R1, a hybrid reward framework for RSI understanding. It introduces: (1) a spatial reasoning activation reward that enforces structured visual reasoning; (2) a perception correctness reward that provides adaptive quality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
