TL;DR
This paper introduces ReconScore, a reference-free evaluation metric for remote sensing image captioning, revealing that powerful unfine-tuned models outperform fine-tuned ones in zero-shot tasks, and proposes a training-free captioning method called RemoteDescriber.
Contribution
The paper presents ReconScore, a novel evaluation metric that reduces bias, and introduces RemoteDescriber, a training-free captioning approach leveraging ReconScore for self-correction.
Findings
Unfined models outperform fine-tuned models in zero-shot RSIC tasks.
ReconScore effectively evaluates caption quality without reference texts.
RemoteDescriber achieves state-of-the-art results on multiple datasets.
Abstract
The core objective of image captioning is to achieve lossless semantic compression from visual signals into textual modalities. However, the reliance on manually curated reference texts for evaluation essentially forces models to mimic specific human annotation styles, thereby masking the true descriptive capabilities of advanced foundation models. This systemic misalignment prompts a critical question: Is task-specific fine-tuning truly necessary for Remote Sensing Image Captioning, or is the perceived performance gap merely an artifact of flawed evaluation criteria? To investigate this discrepancy, we propose ReconScore, a novel reference-free evaluation metric. Rather than computing textual similarities, we assess caption quality by its capability to reconstruct the original visual elements solely from the generated text, effectively neutralizing human annotation biases. Applying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
