Visual Question Answering on Multiple Remote Sensing Image Modalities

Hichem Boussaid; Lucrezia Tosato; Flora Weissgerber; Camille Kurtz; Laurent Wendling; Sylvain Lobry

arXiv:2505.15401·cs.CV·May 22, 2025

Visual Question Answering on Multiple Remote Sensing Image Modalities

Hichem Boussaid, Lucrezia Tosato, Flora Weissgerber, Camille Kurtz, Laurent Wendling, Sylvain Lobry

PDF

Open Access

TL;DR

This paper introduces a new multi-modal VQA dataset and model for remote sensing images, leveraging diverse spectral, spatial, and contextual data to improve scene understanding and question answering accuracy.

Contribution

It presents a novel multi-modal VQA dataset (TAMMI) and a transformer-based model (MM-RSVQA) for remote sensing, enabling effective fusion of multiple image modalities.

Findings

01

Achieved 65.56% accuracy on the VQA task.

02

Demonstrated the feasibility of multi-modal fusion in remote sensing.

03

Provided a flexible dataset pipeline for future research.

Abstract

The extraction of visual features is an essential step in Visual Question Answering (VQA). Building a good visual representation of the analyzed scene is indeed one of the essential keys for the system to be able to correctly understand the latter in order to answer complex questions. In many fields such as remote sensing, the visual feature extraction step could benefit significantly from leveraging different image modalities carrying complementary spectral, spatial and contextual information. In this work, we propose to add multiple image modalities to VQA in the particular context of remote sensing, leading to a novel task for the computer vision community. To this end, we introduce a new VQA dataset, named TAMMI (Text and Multi-Modal Imagery) with diverse questions on scenes described by three different modalities (very high resolution RGB, multi-spectral imaging data and synthetic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications

MethodsVisualBERT