Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Visual Foundation Models

Jielu Zhang; Zhongliang Zhou; Gengchen Mai; Mengxuan Hu; Zihan Guan; Sheng Li; Lan Mu

arXiv:2304.10597·cs.CV·January 21, 2026·31 cites

Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Visual Foundation Models

Jielu Zhang, Zhongliang Zhou, Gengchen Mai, Mengxuan Hu, Zihan Guan, Sheng Li, Lan Mu

PDF

Open Access 1 Repo

TL;DR

Text2Seg introduces a novel approach for remote sensing semantic segmentation that leverages visual foundation models and automatic prompt generation to reduce annotation dependency and improve zero-shot performance across diverse datasets.

Contribution

The paper presents Text2Seg, a new method that overcomes annotation limitations and enhances transferability in remote sensing segmentation by using visual foundation models and automatic prompts.

Findings

01

Significant improvement in zero-shot segmentation performance, with relative gains from 31% to 225%.

02

Reduces reliance on fully annotated datasets through automatic prompt generation.

03

Enhances generalization ability across diverse remote sensing datasets.

Abstract

Remote sensing imagery has attracted significant attention in recent years due to its instrumental role in global environmental monitoring, land usage monitoring, and more. As image databases grow each year, performing automatic segmentation with deep learning models has gradually become the standard approach for processing the data. Despite the improved performance of current models, certain limitations remain unresolved. Firstly, training deep learning models for segmentation requires per-pixel annotations. Given the large size of datasets, only a small portion is fully annotated and ready for training. Additionally, the high intra-dataset variance in remote sensing data limits the transfer learning ability of such models. Although recently proposed generic segmentation models like SAM have shown promising results in zero-shot instance-level segmentation, adapting them to semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

douglas2code/text2seg
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Vision Transformer · Linear Layer · Adam · Dense Connections · Label Smoothing · Dropout · Absolute Position Encodings · Position-Wise Feed-Forward Layer