SEMT: Static-Expansion-Mesh Transformer Network Architecture for Remote Sensing Image Captioning

Khang Truong; Lam Pham; Hieu Tang; Jasmin Lampert; Martin Boyer; Son Phan; Truong Nguyen

arXiv:2507.12845·cs.CV·July 18, 2025

SEMT: Static-Expansion-Mesh Transformer Network Architecture for Remote Sensing Image Captioning

Khang Truong, Lam Pham, Hieu Tang, Jasmin Lampert, Martin Boyer, Son Phan, Truong Nguyen

PDF

Open Access

TL;DR

This paper introduces SEMT, a transformer-based architecture for remote sensing image captioning that integrates static expansion, memory-augmented self-attention, and mesh transformer techniques, achieving superior performance on benchmark datasets.

Contribution

The paper proposes a novel transformer architecture for remote sensing image captioning that combines multiple advanced techniques and demonstrates improved results over existing methods.

Findings

01

Outperforms state-of-the-art on UCM-Caption and NWPU-Caption datasets

02

Effective integration of static expansion, memory-augmented self-attention, and mesh transformer

03

Potential for real-world remote sensing applications

Abstract

Image captioning has emerged as a crucial task in the intersection of computer vision and natural language processing, enabling automated generation of descriptive text from visual content. In the context of remote sensing, image captioning plays a significant role in interpreting vast and complex satellite imagery, aiding applications such as environmental monitoring, disaster assessment, and urban planning. This motivates us, in this paper, to present a transformer based network architecture for remote sensing image captioning (RSIC) in which multiple techniques of Static Expansion, Memory-Augmented Self-Attention, Mesh Transformer are evaluated and integrated. We evaluate our proposed models using two benchmark remote sensing image datasets of UCM-Caption and NWPU-Caption. Our best model outperforms the state-of-the-art systems on most of evaluation metrics, which demonstrates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Enhancement Techniques