A Lightweight Sparse Focus Transformer for Remote Sensing Image Change   Captioning

Dongwei Sun; Yajie Bao; Junmin Liu; Xiangyong Cao

arXiv:2405.06598·cs.CV·October 14, 2024·2 cites

A Lightweight Sparse Focus Transformer for Remote Sensing Image Change Captioning

Dongwei Sun, Yajie Bao, Junmin Liu, Xiangyong Cao

PDF

Open Access 1 Repo

TL;DR

This paper introduces a lightweight Sparse Focus Transformer for remote sensing image change captioning, significantly reducing parameters and complexity while maintaining competitive performance.

Contribution

It proposes a sparse focus attention mechanism within a transformer encoder to efficiently capture change regions in remote sensing images.

Findings

01

Reduces transformer encoder parameters by over 90%.

02

Maintains competitive captioning performance.

03

Demonstrates effectiveness across various datasets.

Abstract

Remote sensing image change captioning (RSICC) aims to automatically generate sentences that describe content differences in remote sensing bitemporal images. Recently, attention-based transformers have become a prevalent idea for capturing the features of global change. However, existing transformer-based RSICC methods face challenges, e.g., high parameters and high computational complexity caused by the self-attention operation in the transformer encoder component. To alleviate these issues, this paper proposes a Sparse Focus Transformer (SFT) for the RSICC task. Specifically, the SFT network consists of three main components, i.e. a high-level features extractor based on a convolutional neural network (CNN), a sparse focus attention mechanism-based transformer encoder network designed to locate and capture changing regions in dual-temporal images, and a description decoder that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sundongwei/sft_chag2cap
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Image Enhancement Techniques

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · Dropout · Label Smoothing · Residual Connection · Softmax · Absolute Position Encodings · Byte Pair Encoding