Open-Vocabulary Semantic Segmentation Network Integrating Object-Level Label and Scene-Level Semantic Features for Multimodal Remote Sensing Images

Jinkun Dai; Yuanxin Ye; Peng Tang; Tengfeng Tang; Xianping Ma; Jing Xiao; Mi Wang

arXiv:2604.24125·cs.CV·April 28, 2026

Open-Vocabulary Semantic Segmentation Network Integrating Object-Level Label and Scene-Level Semantic Features for Multimodal Remote Sensing Images

Jinkun Dai, Yuanxin Ye, Peng Tang, Tengfeng Tang, Xianping Ma, Jing Xiao, Mi Wang

PDF

1 Repo

TL;DR

TSMNet is a novel multi-modal remote sensing segmentation model that integrates textual supervision with visual data, enhancing accuracy and generalization across diverse scenarios.

Contribution

It introduces a dual-branch text encoder and a text-guided fusion module, pioneering the integration of textual knowledge into remote sensing segmentation.

Findings

01

TSMNet outperforms state-of-the-art models in accuracy.

02

The model demonstrates strong generalization across different geographical and sensor data.

03

Constructed new multi-modal datasets for comprehensive evaluation.

Abstract

Semantic segmentation of multi-modal remote sensing imagery plays a pivotal role in land use/land cover (LULC) mapping, environmental monitoring, and precision earth observation. Current multi-modal approaches mainly focus on integrating complementary visual modalities, yet neglect the incorporating of non-visual textual data - a rich source of knowledge that can bridge semantic gaps between visual patterns and real-world concepts. To address this limitation, we propose TSMNet, a text supervised multi-modal open vocabulary semantic segmentation network that synergistically integrates textual supervision with visual representation for open-vocabulary semantic segmentation. Unlike conventional multi-modal segmentation frameworks, TSMNet introduces a dual-branch text encoder to extract both scene-level semantic and object-level label information from various textual data, enabling dynamic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yeyuanxin110/TSMNet
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.