GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding

Jiashun Zhu; Ronghao Fu; Jiasen Hu; Nachuan Xing; Xu Na; Xiao Yang; Zhiwen Lin; Weipeng Zhang; Lang Sun; Zhiheng Xue; Haoran Liu; Weijie Zhang; Bo Yang

arXiv:2605.14475·cs.CV·May 15, 2026

GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding

Jiashun Zhu, Ronghao Fu, Jiasen Hu, Nachuan Xing, Xu Na, Xiao Yang, Zhiwen Lin, Weipeng Zhang, Lang Sun, Zhiheng Xue, Haoran Liu, Weijie Zhang, Bo Yang

PDF

1 Repo

TL;DR

GeoVista introduces a planning-driven active perception framework for ultra-high-resolution remote sensing images, enabling global exploration, multi-region inspection, and evidence aggregation to improve understanding.

Contribution

It proposes a novel global planning and local verification approach, along with a new trajectory corpus, for more effective remote sensing image interpretation.

Findings

01

Achieves state-of-the-art performance on multiple benchmarks.

02

Effectively verifies multiple candidate regions through branch-wise inspection.

03

Maintains explicit evidence state for cross-region aggregation and de-duplication.

Abstract

Interpreting ultra-high-resolution (UHR) remote sensing images requires models to search for sparse and tiny visual evidence across large-scale scenes. Existing remote sensing vision-language models can inspect local regions with zooming and cropping tools, but most exploration strategies follow either a one-shot focus or a single sequential trajectory. Such single-path exploration can lose global context, leave scattered regions unvisited, and revisit or count the same evidence multiple times. To this end, we propose GeoVista, a planning-driven active perception framework for UHR remote sensing interpretation. Instead of committing to one zooming path, GeoVista first builds a global exploration plan, then verifies multiple candidate regions through branch-wise local inspection, while maintaining an explicit evidence state for cross-region aggregation and de-duplication. To enable this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ryan6073/GeoVista
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.