A Visual Question Answering Method for SAR Ship: Breaking the Requirement for Multimodal Dataset Construction and Model Fine-Tuning
Fei Wang, Chengcheng Chen, Hongyu Chen, Yugang Chang, and Weiming Zeng

TL;DR
This paper introduces a novel SAR ship VQA method that combines object detection with visual language models, eliminating the need for multimodal dataset construction and fine-tuning, thus enabling effective scene analysis and multi-turn dialogue in SAR imagery.
Contribution
The proposed approach integrates object detection and vision-language models for SAR ship analysis, avoiding dataset creation and fine-tuning, and supports complex multi-turn question answering.
Findings
YOLOv8n achieved optimal detection accuracy on SAR datasets
The method enables SAR scene question-answering without additional datasets or fine-tuning
The system demonstrates robust semantic understanding and multi-turn dialogue capabilities
Abstract
Current visual question answering (VQA) tasks often require constructing multimodal datasets and fine-tuning visual language models, which demands significant time and resources. This has greatly hindered the application of VQA to downstream tasks, such as ship information analysis based on Synthetic Aperture Radar (SAR) imagery. To address this challenge, this letter proposes a novel VQA approach that integrates object detection networks with visual language models, specifically designed for analyzing ships in SAR images. This integration aims to enhance the capabilities of VQA systems, focusing on aspects such as ship location, density, and size analysis, as well as risk behavior detection. Initially, we conducted baseline experiments using YOLO networks on two representative SAR ship detection datasets, SSDD and HRSID, to assess each model's performance in terms of detection…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Geographic Information Systems Studies · Speech and dialogue systems
