Masks Can Talk: Extracting Structured Text Information from Single-Modal Images for Remote Sensing Change Detection

Kai Zheng; Hang-Cheng Dong; Jiatong Pan; Zhenkai Wu; Fupeng Wei; Wei Zhang

arXiv:2605.07178·cs.CV·May 11, 2026

Masks Can Talk: Extracting Structured Text Information from Single-Modal Images for Remote Sensing Change Detection

Kai Zheng, Hang-Cheng Dong, Jiatong Pan, Zhenkai Wu, Fupeng Wei, Wei Zhang

PDF

TL;DR

This paper introduces S2M, a framework that extracts structured textual information from change masks in remote sensing images, improving change detection without extra annotation.

Contribution

S2M automatically transcribes change masks into structured semantic quadruples, enabling dense multimodal supervision without additional annotation cost.

Findings

01

S2M outperforms existing multimodal methods on Gaza-Change-v2 dataset.

02

Structured textual features improve change detection accuracy.

03

Masks encode detailed change semantics that can be extracted for better analysis.

Abstract

Remote sensing change detection is pivotal for urban monitoring, disaster assessment, and environmental resource management. Yet, unimodal deep learning methods frequently confuse genuine semantic changes with visually similar but irrelevant variations. Recent multimodal approaches incorporate text as auxiliary supervision, but their descriptions are either semantically coarse and unstructured or model-generated and thus noisy. Critically, all of them overlook a simple fact: fine-grained change semantics are already implicitly encoded in the ground-truth mask labels that come standard with every change detection dataset. These masks know where the change happened, what the land-cover types were before and after, how the transition occurred, and how many objects were involved. In this paper, we propose S2M, a framework that obtains structured textual features directly from change labels…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.