GeoMag: A Vision-Language Model for Pixel-level Fine-Grained Remote Sensing Image Parsing
Xianzhi Ma, Jianhui Li, Changhua Pei, Hao Liu

TL;DR
GeoMag is a versatile vision-language model designed for pixel-level remote sensing image parsing, improving fine-grained understanding and efficiency in high-resolution imagery through adaptive attention and resolution techniques.
Contribution
The paper introduces GeoMag, a novel framework with task-driven multi-granularity adjustment and semantic-aware cropping for enhanced pixel-level remote sensing image analysis.
Findings
Outperforms existing models on 10 benchmarks.
Effectively reduces computational costs.
Excels in small-object recognition scenarios.
Abstract
The application of Vision-Language Models (VLMs) in remote sensing (RS) image understanding has achieved notable progress, demonstrating the basic ability to recognize and describe geographical entities. However, existing RS-VLMs are mostly limited to image-level and region-level tasks, lacking the capability to handle pixel-level tasks and performing poorly in small-object recognition scenarios. Moreover, RS-VLMs consume significant computational resources when processing high-resolution RS images, further restricting their practical applicability. In this context, we propose GeoMag (Geographical Magnifier), an end-to-end general-purpose large model framework for RS. GeoMag dynamically focuses the attention scope based on prompt semantics to effectively perform remote sensing image parsing across multiple levels of granularity. This method introduces Task-driven Multi-granularity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRemote-Sensing Image Classification · Multimodal Machine Learning Applications · Advanced Neural Network Applications
