Masks Can Talk: Extracting Structured Text Information from Single-Modal Images for Remote Sensing Change Detection
Kai Zheng, Hang-Cheng Dong, Jiatong Pan, Zhenkai Wu, Fupeng Wei, Wei Zhang

TL;DR
This paper introduces S2M, a framework that extracts structured textual information from change masks in remote sensing images, improving change detection without extra annotation.
Contribution
S2M automatically transcribes change masks into structured semantic quadruples, enabling dense multimodal supervision without additional annotation cost.
Findings
S2M outperforms existing multimodal methods on Gaza-Change-v2 dataset.
Structured textual features improve change detection accuracy.
Masks encode detailed change semantics that can be extracted for better analysis.
Abstract
Remote sensing change detection is pivotal for urban monitoring, disaster assessment, and environmental resource management. Yet, unimodal deep learning methods frequently confuse genuine semantic changes with visually similar but irrelevant variations. Recent multimodal approaches incorporate text as auxiliary supervision, but their descriptions are either semantically coarse and unstructured or model-generated and thus noisy. Critically, all of them overlook a simple fact: fine-grained change semantics are already implicitly encoded in the ground-truth mask labels that come standard with every change detection dataset. These masks know where the change happened, what the land-cover types were before and after, how the transition occurred, and how many objects were involved. In this paper, we propose S2M, a framework that obtains structured textual features directly from change labels…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
