TL;DR
GeoPix is a novel multi-modal large language model for remote sensing that enables pixel-level segmentation with user instructions, utilizing a new dataset and training strategy to improve detailed image understanding.
Contribution
The paper introduces GeoPix, the first RS MLLM capable of pixel-level segmentation, with a new dataset and a two-stage training method for multi-task optimization.
Findings
GeoPix outperforms existing models in pixel-level segmentation tasks.
It maintains competitive performance in image- and region-level benchmarks.
The dataset and training strategy effectively support pixel-level remote sensing tasks.
Abstract
Multi-modal large language models (MLLMs) have achieved remarkable success in image- and region-level remote sensing (RS) image understanding tasks, such as image captioning, visual question answering, and visual grounding. However, existing RS MLLMs lack the pixel-level dialogue capability, which involves responding to user instructions with segmentation masks for specific instances. In this paper, we propose GeoPix, a RS MLLM that extends image understanding capabilities to the pixel level. This is achieved by equipping the MLLM with a mask predictor, which transforms visual features from the vision encoder into masks conditioned on the LLM's segmentation token embeddings. To facilitate the segmentation of multi-scale objects in RS imagery, a class-wise learnable memory module is integrated into the mask predictor to capture and store class-wise geo-context at the instance level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
