A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning
Swadhin Das, Divyansh Mundra, Priyanshu Dayal, Raksha Sharma

TL;DR
This paper introduces a lightweight transformer model with edge-aware fusion and knowledge distillation for remote sensing image captioning, achieving high-quality captions with reduced computational costs.
Contribution
It proposes a novel lightweight transformer architecture with edge-aware enhancement and knowledge distillation, improving captioning performance while reducing complexity.
Findings
Significant improvement in caption quality over state-of-the-art methods
Reduced computational costs due to lightweight design
Enhanced boundary and structural feature representation
Abstract
Transformer-based models have achieved strong performance in remote sensing image captioning by capturing long-range dependencies and contextual information. However, their practical deployment is hindered by high computational costs, especially in multi-modal frameworks that employ separate transformer-based encoders and decoders. In addition, existing remote sensing image captioning models primarily focus on high-level semantic extraction while often overlooking fine-grained structural features such as edges, contours, and object boundaries. To address these challenges, a lightweight transformer architecture is proposed by reducing the dimensionality of the encoder layers and employing a distilled version of GPT-2 as the decoder. A knowledge distillation strategy is used to transfer knowledge from a more complex teacher model to improve the performance of the lightweight network.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Softmax · Layer Normalization · Adam · Cosine Annealing · Byte Pair Encoding · Attention Is All You Need · Linear Warmup With Cosine Annealing · Focus
