MsEdF: A Multi-stream Encoder-decoder Framework for Remote Sensing Image Captioning
Swadhin Das, Raksha Sharma

TL;DR
This paper introduces MsEdF, a multi-stream encoder-decoder framework for remote sensing image captioning that enhances feature diversity and semantic modeling to improve descriptive accuracy.
Contribution
The novel multi-stream architecture fuses diverse spatial features and refines semantic context modeling, advancing RSIC performance over single-stream methods.
Findings
MsEdF outperforms baseline models on three benchmark datasets.
Fusing multiscale and structural cues enhances feature diversity.
Refined semantic modeling improves caption accuracy.
Abstract
Remote sensing images contain complex spatial patterns and semantic structures, which makes the captioning model difficult to accurately describe. Encoder-decoder architectures have become the widely used approach for RSIC by translating visual content into descriptive text. However, many existing methods rely on a single-stream architecture, which weakens the model to accurately describe the image. Such single-stream architectures typically struggle to extract diverse spatial features or capture complex semantic relationships, limiting their effectiveness in scenes with high intraclass similarity or contextual ambiguity. In this work, we propose a novel Multi-stream Encoder-decoder Framework (MsEdF) which improves the performance of RSIC by optimizing both the spatial representation and language generation of encoder-decoder architecture. The encoder fuses information from two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
