TL;DR
SARVLM is a pioneering vision-language foundation model specifically designed for semantic understanding in SAR imagery, utilizing a large-scale dataset and a novel domain transfer training strategy.
Contribution
The paper introduces SARVLM, the first SAR-specific vision-language model, and a large-scale SARVLM-1M dataset, with a two-stage domain transfer approach from natural images.
Findings
Outperforms state-of-the-art vision-language models on 13 benchmarks.
Demonstrates strong capabilities in image-text retrieval, object detection, and zero-shot classification.
Validates effectiveness through extensive experiments across diverse tasks.
Abstract
Synthetic Aperture Radar (SAR) is a critical imaging modality due to its all-weather operational capability. Although recent advances in self-supervised learning and masked image modeling (MIM) have enabled SAR foundation models, these approaches primarily focus on low-level visual features and often neglect multi-modal representation. Moreover, multimodal data for SAR is scarce, limiting the development of robust cross-modal models. To address this limitation, we construct SARVLM-1M, a large-scale vision-language dataset comprising over one million image-text pairs aggregated from existing datasets. Furthermore, to mitigate the substantial differences between SAR and natural imagery, we propose a two-stage domain transfer training strategy that leverages optical remote sensing data as an intermediate bridge, facilitating effective knowledge transfer from natural images to SAR domains.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced SAR Imaging Techniques · Multimodal Machine Learning Applications
