FUSAR-GPT : A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery
Xiaokun Zhang, Yi Yang, Ziqi Ye, Baiyun, Xiaorong Guo, Qingchen Fang, Ruyi Zhang, Xinpeng Zhou, Haipeng Wang

TL;DR
FUSAR-GPT is a specialized visual language model for SAR imagery that incorporates geospatial and spatiotemporal features, achieving state-of-the-art performance in remote sensing tasks.
Contribution
The paper introduces FUSAR-GPT, a novel SAR-specific VLM with spatiotemporal feature embedding and a two-stage decoupling strategy, addressing limitations of existing models.
Findings
FUSAR-GPT outperforms baseline models by over 10% on remote sensing benchmarks.
Constructed the first SAR Image-Text-AlphaEarth feature triplet dataset.
Achieved state-of-the-art results in SAR visual-language understanding.
Abstract
Research on the intelligent interpretation of all-weather, all-time Synthetic Aperture Radar (SAR) is crucial for advancing remote sensing applications. In recent years, although Visual Language Models (VLMs) have demonstrated strong open-world understanding capabilities on RGB images, their performance is severely limited when directly applied to the SAR field due to the complexity of the imaging mechanism, sensitivity to scattering features, and the scarcity of high-quality text corpora. To systematically address this issue, we constructed the inaugural SAR Image-Text-AlphaEarth feature triplet dataset and developed FUSAR-GPT, a VLM specifically for SAR. FUSAR-GPT innovatively introduces a geospatial baseline model as a 'world knowledge' prior and embeds multi-source remote-sensing temporal features into the model's visual backbone via 'spatiotemporal anchors', enabling dynamic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
