Text-Guided Coarse-to-Fine Fusion Network for Robust Remote Sensing   Visual Question Answering

Zhicheng Zhao; Changfu Zhou; Yu Zhang; Chenglong Li; Xiaoliang Ma and; Jin Tang

arXiv:2411.15770·cs.CV·January 14, 2025

Text-Guided Coarse-to-Fine Fusion Network for Robust Remote Sensing Visual Question Answering

Zhicheng Zhao, Changfu Zhou, Yu Zhang, Chenglong Li, Xiaoliang Ma and, Jin Tang

PDF

Open Access

TL;DR

This paper introduces TGFNet, a novel network that fuses optical and SAR remote sensing images guided by question semantics, significantly improving RSVQA performance under challenging conditions.

Contribution

The work presents a new text-guided coarse-to-fine attention mechanism and an adaptive multi-expert fusion module, along with the first large-scale optical-SAR RSVQA dataset.

Findings

01

TGFNet outperforms existing methods in challenging scenarios.

02

The proposed modules effectively focus on relevant image regions.

03

The dataset enables comprehensive evaluation of optical-SAR RSVQA models.

Abstract

Remote Sensing Visual Question Answering (RSVQA) has gained significant research interest. However, current RSVQA methods are limited by the imaging mechanisms of optical sensors, particularly under challenging conditions such as cloud-covered and low-light scenarios. Given the all-time and all-weather imaging capabilities of Synthetic Aperture Radar (SAR), it is crucial to investigate the integration of optical-SAR images to improve RSVQA performance. In this work, we propose a Text-guided Coarse-to-Fine Fusion Network (TGFNet), which leverages the semantic relationships between question text and multi-source images to guide the network toward complementary fusion at the feature level. Specifically, we develop a Text-guided Coarse-to-Fine Attention Refinement (CFAR) module to focus on key areas related to the question in complex remote sensing images. This module progressively directs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques

MethodsSoftmax · Attention Is All You Need · Focus