A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning

Swadhin Das; Divyansh Mundra; Priyanshu Dayal; Raksha Sharma

arXiv:2506.09429·cs.CV·June 12, 2025

A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning

Swadhin Das, Divyansh Mundra, Priyanshu Dayal, Raksha Sharma

PDF

Open Access

TL;DR

This paper introduces a lightweight transformer model with edge-aware fusion and knowledge distillation for remote sensing image captioning, achieving high-quality captions with reduced computational costs.

Contribution

It proposes a novel lightweight transformer architecture with edge-aware enhancement and knowledge distillation, improving captioning performance while reducing complexity.

Findings

01

Significant improvement in caption quality over state-of-the-art methods

02

Reduced computational costs due to lightweight design

03

Enhanced boundary and structural feature representation

Abstract

Transformer-based models have achieved strong performance in remote sensing image captioning by capturing long-range dependencies and contextual information. However, their practical deployment is hindered by high computational costs, especially in multi-modal frameworks that employ separate transformer-based encoders and decoders. In addition, existing remote sensing image captioning models primarily focus on high-level semantic extraction while often overlooking fine-grained structural features such as edges, contours, and object boundaries. To address these challenges, a lightweight transformer architecture is proposed by reducing the dimensionality of the encoder layers and employing a distilled version of GPT-2 as the decoder. A knowledge distillation strategy is used to transfer knowledge from a more complex teacher model to improve the performance of the lightweight network.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Softmax · Layer Normalization · Adam · Cosine Annealing · Byte Pair Encoding · Attention Is All You Need · Linear Warmup With Cosine Annealing · Focus