TL;DR
GeoVista introduces a planning-driven active perception framework for ultra-high-resolution remote sensing images, enabling global exploration, multi-region inspection, and evidence aggregation to improve understanding.
Contribution
It proposes a novel global planning and local verification approach, along with a new trajectory corpus, for more effective remote sensing image interpretation.
Findings
Achieves state-of-the-art performance on multiple benchmarks.
Effectively verifies multiple candidate regions through branch-wise inspection.
Maintains explicit evidence state for cross-region aggregation and de-duplication.
Abstract
Interpreting ultra-high-resolution (UHR) remote sensing images requires models to search for sparse and tiny visual evidence across large-scale scenes. Existing remote sensing vision-language models can inspect local regions with zooming and cropping tools, but most exploration strategies follow either a one-shot focus or a single sequential trajectory. Such single-path exploration can lose global context, leave scattered regions unvisited, and revisit or count the same evidence multiple times. To this end, we propose GeoVista, a planning-driven active perception framework for UHR remote sensing interpretation. Instead of committing to one zooming path, GeoVista first builds a global exploration plan, then verifies multiple candidate regions through branch-wise local inspection, while maintaining an explicit evidence state for cross-region aggregation and de-duplication. To enable this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
