TL;DR
TSMNet is a novel multi-modal remote sensing segmentation model that integrates textual supervision with visual data, enhancing accuracy and generalization across diverse scenarios.
Contribution
It introduces a dual-branch text encoder and a text-guided fusion module, pioneering the integration of textual knowledge into remote sensing segmentation.
Findings
TSMNet outperforms state-of-the-art models in accuracy.
The model demonstrates strong generalization across different geographical and sensor data.
Constructed new multi-modal datasets for comprehensive evaluation.
Abstract
Semantic segmentation of multi-modal remote sensing imagery plays a pivotal role in land use/land cover (LULC) mapping, environmental monitoring, and precision earth observation. Current multi-modal approaches mainly focus on integrating complementary visual modalities, yet neglect the incorporating of non-visual textual data - a rich source of knowledge that can bridge semantic gaps between visual patterns and real-world concepts. To address this limitation, we propose TSMNet, a text supervised multi-modal open vocabulary semantic segmentation network that synergistically integrates textual supervision with visual representation for open-vocabulary semantic segmentation. Unlike conventional multi-modal segmentation frameworks, TSMNet introduces a dual-branch text encoder to extract both scene-level semantic and object-level label information from various textual data, enabling dynamic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
