A Visual Question Answering Method for SAR Ship: Breaking the   Requirement for Multimodal Dataset Construction and Model Fine-Tuning

Fei Wang; Chengcheng Chen; Hongyu Chen; Yugang Chang; and Weiming Zeng

arXiv:2411.01445·cs.CV·November 5, 2024

A Visual Question Answering Method for SAR Ship: Breaking the Requirement for Multimodal Dataset Construction and Model Fine-Tuning

Fei Wang, Chengcheng Chen, Hongyu Chen, Yugang Chang, and Weiming Zeng

PDF

Open Access

TL;DR

This paper introduces a novel SAR ship VQA method that combines object detection with visual language models, eliminating the need for multimodal dataset construction and fine-tuning, thus enabling effective scene analysis and multi-turn dialogue in SAR imagery.

Contribution

The proposed approach integrates object detection and vision-language models for SAR ship analysis, avoiding dataset creation and fine-tuning, and supports complex multi-turn question answering.

Findings

01

YOLOv8n achieved optimal detection accuracy on SAR datasets

02

The method enables SAR scene question-answering without additional datasets or fine-tuning

03

The system demonstrates robust semantic understanding and multi-turn dialogue capabilities

Abstract

Current visual question answering (VQA) tasks often require constructing multimodal datasets and fine-tuning visual language models, which demands significant time and resources. This has greatly hindered the application of VQA to downstream tasks, such as ship information analysis based on Synthetic Aperture Radar (SAR) imagery. To address this challenge, this letter proposes a novel VQA approach that integrates object detection networks with visual language models, specifically designed for analyzing ships in SAR images. This integration aims to enhance the capabilities of VQA systems, focusing on aspects such as ship location, density, and size analysis, as well as risk behavior detection. Initially, we conducted baseline experiments using YOLO networks on two representative SAR ship detection datasets, SSDD and HRSID, to assess each model's performance in terms of detection…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Geographic Information Systems Studies · Speech and dialogue systems