TL;DR
This paper introduces UHR-Micro, a benchmark and diagnostic platform for evaluating and improving high-resolution Earth observation vision-language models, addressing the challenge of the resolution illusion where higher resolution does not guarantee better micro-scale perception.
Contribution
The paper presents UHR-Micro, a comprehensive benchmark for micro-level reasoning in Earth observation VLMs, and proposes MAP, an active perception agent that enhances micro-evidence grounding.
Findings
High-resolution VLMs often fail in spatial grounding despite detailed inputs.
Increasing model capacity does not fully resolve micro-evidence perception issues.
MAP improves micro-level perception by actively seeking and grounding evidence.
Abstract
Vision-Language Models (VLMs) increasingly operate on ultra-high-resolution (UHR) Earth observation imagery, yet they remain vulnerable to a severe scale mismatch between large-scale scene context and micro-scale targets. We refer to this empirical gap as a "resolution illusion": higher input resolution provides the appearance of richer visual detail, but does not necessarily yield reliable perception of spatially small, task-relevant evidence. To benchmark this challenge, we introduce UHR-Micro, a benchmark comprising 11,253 instructions grounded in 1,212 UHR images, designed to evaluate VLMs at the spatial limits of native Earth observation imagery. UHR-Micro spans diverse micro-target scales, context requirements, task families, and visual conditions, and provides diagnostic annotations that support controlled evaluation and fine-grained error attribution. Experiments with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
