GeoPix: Multi-Modal Large Language Model for Pixel-level Image   Understanding in Remote Sensing

Ruizhe Ou; Yuan Hu; Fan Zhang; Jiaxin Chen; Yu Liu

arXiv:2501.06828·cs.CV·March 14, 2025

GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing

Ruizhe Ou, Yuan Hu, Fan Zhang, Jiaxin Chen, Yu Liu

PDF

1 Models

TL;DR

GeoPix is a novel multi-modal large language model for remote sensing that enables pixel-level segmentation with user instructions, utilizing a new dataset and training strategy to improve detailed image understanding.

Contribution

The paper introduces GeoPix, the first RS MLLM capable of pixel-level segmentation, with a new dataset and a two-stage training method for multi-task optimization.

Findings

01

GeoPix outperforms existing models in pixel-level segmentation tasks.

02

It maintains competitive performance in image- and region-level benchmarks.

03

The dataset and training strategy effectively support pixel-level remote sensing tasks.

Abstract

Multi-modal large language models (MLLMs) have achieved remarkable success in image- and region-level remote sensing (RS) image understanding tasks, such as image captioning, visual question answering, and visual grounding. However, existing RS MLLMs lack the pixel-level dialogue capability, which involves responding to user instructions with segmentation masks for specific instances. In this paper, we propose GeoPix, a RS MLLM that extends image understanding capabilities to the pixel level. This is achieved by equipping the MLLM with a mask predictor, which transforms visual features from the vision encoder into masks conditioned on the LLM's segmentation token embeddings. To facilitate the segmentation of multi-scale objects in RS imagery, a class-wise learnable memory module is integrated into the mask predictor to capture and store class-wise geo-context at the instance level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Norman-ou/GeoPix-ft-sior_rsicap
model· 53 dl· ♡ 1
53 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.