TL;DR
This paper introduces PTNet, a novel framework for change captioning and detection in urban scenes, along with a large UAV-based dataset UCCD for urban construction monitoring.
Contribution
It proposes a structured change semantics modeling approach and a new benchmark dataset, advancing semantic understanding in high-resolution urban change detection.
Findings
PTNet outperforms existing methods on UCCD and WHU-CDC datasets.
UCCD dataset contains 9,000 image pairs and 45,000 sentences for urban monitoring.
The source code and dataset are publicly available.
Abstract
Remote Sensing Image Change Captioning (RSICC) aims to generate spatially grounded natural language descriptions of scene evolution from bi-temporal imagery, moving beyond binary change masks toward semantic-level understanding. However, existing methods rely on implicit feature differencing without explicitly modeling structured change semantics, and struggle to reconcile the conflicting representation demands of change detection and caption generation. In addition, current benchmarks provide limited coverage of high-resolution urban construction scenarios. To address these challenges, we propose PTNet, a prototype-guided task-adaptive framework for joint change captioning and detection. PTNet explicitly models structured change semantics through a learnable prototype bank that guides cross-temporal interaction, disentangles task-specific representations via multi-head gating, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
