RSCaMa: Remote Sensing Image Change Captioning with State Space Model

Chenyang Liu; Keyan Chen; Bowen Chen; Haotian Zhang; Zhengxia Zou; and; Zhenwei Shi

arXiv:2404.18895·cs.CV·May 22, 2024

RSCaMa: Remote Sensing Image Change Captioning with State Space Model

Chenyang Liu, Keyan Chen, Bowen Chen, Haotian Zhang, Zhengxia Zou, and, Zhenwei Shi

PDF

Open Access 1 Repo

TL;DR

This paper introduces RSCaMa, a novel model for remote sensing image change captioning that employs state space models for efficient joint spatial-temporal feature modeling, significantly improving performance over previous methods.

Contribution

The paper proposes RSCaMa, integrating Mamba-based spatial and temporal SSMs for enhanced bi-temporal feature refinement in RSICC, and systematically compares different language decoders.

Findings

01

RSCaMa achieves superior accuracy in RSICC tasks.

02

Mamba-based models outperform CNN and Transformer counterparts.

03

The study provides insights into language decoder effectiveness.

Abstract

Remote Sensing Image Change Captioning (RSICC) aims to describe surface changes between multi-temporal remote sensing images in language, including the changed object categories, locations, and dynamics of changing objects (e.g., added or disappeared). This poses challenges to spatial and temporal modeling of bi-temporal features. Despite previous methods progressing in the spatial change perception, there are still weaknesses in joint spatial-temporal modeling. To address this, in this paper, we propose a novel RSCaMa model, which achieves efficient joint spatial-temporal modeling through multiple CaMa layers, enabling iterative refinement of bi-temporal features. To achieve efficient spatial modeling, we introduce the recently popular Mamba (a state space model) with a global receptive field and linear complexity into the RSICC task and propose the Spatial Difference-aware SSM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chen-yang-liu/rscama
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques

MethodsAttention Is All You Need · Dropout · Residual Connection · Softmax · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Absolute Position Encodings · Linear Layer · Dense Connections · Label Smoothing