MV-CC: Mask Enhanced Video Model for Remote Sensing Change Caption
Ruixun Liu, Kaiyu Li, Jiayi Song, Dongwei Sun, Xiangyong Cao

TL;DR
This paper introduces MV-CC, a novel video model-based approach for remote sensing change captioning that simplifies architecture by removing the need for complex fusion modules and employs masks to focus on change regions, resulting in improved performance.
Contribution
The paper proposes a mask-enhanced video model for change captioning that eliminates the manual fusion module design, leveraging off-the-shelf video encoders and change masks for better focus and accuracy.
Findings
Outperforms state-of-the-art RSICC methods
Uses off-the-shelf video encoder for spatial and temporal features
Employs change masks to improve focus on regions of interest
Abstract
Remote sensing image change caption (RSICC) aims to provide natural language descriptions for bi-temporal remote sensing images. Since Change Caption (CC) task requires both spatial and temporal features, previous works follow an encoder-fusion-decoder architecture. They use an image encoder to extract spatial features and the fusion module to integrate spatial features and extract temporal features, which leads to increasingly complex manual design of the fusion module. In this paper, we introduce a novel video model-based paradigm without design of the fusion module and propose a Mask-enhanced Video model for Change Caption (MV-CC). Specifically, we use the off-the-shelf video encoder to simultaneously extract the temporal and spatial features of bi-temporal images. Furthermore, the types of changes in the CC are set based on specific task requirements, and to enable the model to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Satellite Image Processing and Photogrammetry
MethodsSparse Evolutionary Training · Focus
