Sentinel2Cap: A Human-Annotated Benchmark Dataset for Multimodal Remote Sensing Image Captioning
Lucrezia Tosato, Gianluca Lombardi, Ronny Hansch

TL;DR
Sentinel2Cap is a new human-annotated multimodal remote sensing dataset for image captioning, including SAR and multi-spectral images, to advance cross-modal scene understanding.
Contribution
It introduces Sentinel2Cap, the first comprehensive human-annotated dataset for multimodal satellite image captioning across RGB, multi-spectral, and SAR modalities.
Findings
RGB images yield the best captioning performance.
SAR images are more challenging for vision-language models.
Modality-specific prompts improve captioning accuracy.
Abstract
Image captioning has become an important task in computer vision, enabling models to generate natural language descriptions of visual content. While several datasets exist for natural images and high-resolution optical remote sensing imagery, the availability of captioning datasets for multimodal satellite data remains limited, particularly for SAR imagery and medium-resolution sensors. We introduce Sentinel2Cap, a human-annotated multimodal captioning dataset containing Sentinel-1 SAR and Sentinel-2 multi-spectral image patches at 10 m and 20 m spatial resolution with diverse land cover compositions. Captions are created manually and carefully validated to ensure both semantic accuracy and linguistic quality. To evaluate Sentinel2Cap, we perform a zero-shot captioning using the Qwen3-VL-8B-Instruct model across three image modalities: RGB, multi-spectral, and SAR pseudo-RGB…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
