GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding

Yue Zhou; Mengcheng Lan; Xiang Li; Litong Feng; Yiping Ke; Xue Jiang; Qingyun Li; Xue Yang; Wayne Zhang

arXiv:2411.11904·cs.CV·May 13, 2025·2 cites

GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding

Yue Zhou, Mengcheng Lan, Xiang Li, Litong Feng, Yiping Ke, Xue Jiang, Qingyun Li, Xue Yang, Wayne Zhang

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

GeoGround is a unified large vision-language model that effectively handles diverse remote sensing visual grounding tasks, including bounding boxes, oriented boxes, and segmentation masks, by supporting flexible outputs and leveraging prompt-assisted and geometry-guided learning.

Contribution

It introduces a novel framework that unifies multiple RS visual grounding tasks within a single model without customizing architecture, supporting dense prediction outputs via the Text-Mask technique.

Findings

01

Strong performance across four RS visual grounding tasks

02

Matches specialized methods on multiple benchmarks

03

Supports flexible output types including masks and bounding boxes

Abstract

Remote sensing (RS) visual grounding aims to use natural language expression to locate specific objects (in the form of the bounding box or segmentation mask) in RS images, enhancing human interaction with intelligent RS interpretation systems. Early research in this area was primarily based on horizontal bounding boxes (HBBs), but as more diverse RS datasets have become available, tasks involving oriented bounding boxes (OBBs) and segmentation masks have emerged. In practical applications, different targets require different grounding types: HBB can localize an object's position, OBB provides its orientation, and mask depicts its shape. However, existing specialized methods are typically tailored to a single type of RS visual grounding task and are hard to generalize across tasks. In contrast, large vision-language models (VLMs) exhibit powerful multi-task learning capabilities but…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zytx121/geoground
noneOfficial

Models

🤗
linhuixiao/Awesome-Visual-Grounding
model· ♡ 1
♡ 1

Datasets

erenzhou/refGeo
dataset· 112 dl
112 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGeographic Information Systems Studies · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques