Change Captioning in Remote Sensing: Evolution to SAT-Cap -- A Single-Stage Transformer Approach
Yuduo Wang, Weikang Yu, Pedram Ghamisi

TL;DR
This paper introduces SAT-Cap, a single-stage transformer model for remote sensing change captioning that reduces complexity and improves semantic detail extraction, outperforming existing methods on key datasets.
Contribution
The paper presents SAT-Cap, a novel single-stage transformer approach with spatial-channel attention and cosine similarity fusion for more efficient and detailed change captioning in remote sensing images.
Findings
Achieves CIDEr scores of 140.23% on LEVIR-CC dataset.
Achieves CIDEr scores of 97.74% on DUBAI-CC dataset.
Outperforms current state-of-the-art methods.
Abstract
Change captioning has become essential for accurately describing changes in multi-temporal remote sensing data, providing an intuitive way to monitor Earth's dynamics through natural language. However, existing change captioning methods face two key challenges: high computational demands due to multistage fusion strategy, and insufficient detail in object descriptions due to limited semantic extraction from individual images. To solve these challenges, we propose SAT-Cap based on the transformers model with a single-stage feature fusion for remote sensing change captioning. In particular, SAT-Cap integrates a Spatial-Channel Attention Encoder, a Difference-Guided Fusion module, and a Caption Decoder. Compared to typical models that require multi-stage fusion in transformer encoder and fusion module, SAT-Cap uses only a simple cosine similarity-based fusion module for information…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need
