RSCaMa: Remote Sensing Image Change Captioning with State Space Model
Chenyang Liu, Keyan Chen, Bowen Chen, Haotian Zhang, Zhengxia Zou, and, Zhenwei Shi

TL;DR
This paper introduces RSCaMa, a novel model for remote sensing image change captioning that employs state space models for efficient joint spatial-temporal feature modeling, significantly improving performance over previous methods.
Contribution
The paper proposes RSCaMa, integrating Mamba-based spatial and temporal SSMs for enhanced bi-temporal feature refinement in RSICC, and systematically compares different language decoders.
Findings
RSCaMa achieves superior accuracy in RSICC tasks.
Mamba-based models outperform CNN and Transformer counterparts.
The study provides insights into language decoder effectiveness.
Abstract
Remote Sensing Image Change Captioning (RSICC) aims to describe surface changes between multi-temporal remote sensing images in language, including the changed object categories, locations, and dynamics of changing objects (e.g., added or disappeared). This poses challenges to spatial and temporal modeling of bi-temporal features. Despite previous methods progressing in the spatial change perception, there are still weaknesses in joint spatial-temporal modeling. To address this, in this paper, we propose a novel RSCaMa model, which achieves efficient joint spatial-temporal modeling through multiple CaMa layers, enabling iterative refinement of bi-temporal features. To achieve efficient spatial modeling, we introduce the recently popular Mamba (a state space model) with a global receptive field and linear complexity into the RSICC task and propose the Spatial Difference-aware SSM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques
MethodsAttention Is All You Need · Dropout · Residual Connection · Softmax · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Absolute Position Encodings · Linear Layer · Dense Connections · Label Smoothing
