Visual Question Answering on Multiple Remote Sensing Image Modalities
Hichem Boussaid, Lucrezia Tosato, Flora Weissgerber, Camille Kurtz, Laurent Wendling, Sylvain Lobry

TL;DR
This paper introduces a new multi-modal VQA dataset and model for remote sensing images, leveraging diverse spectral, spatial, and contextual data to improve scene understanding and question answering accuracy.
Contribution
It presents a novel multi-modal VQA dataset (TAMMI) and a transformer-based model (MM-RSVQA) for remote sensing, enabling effective fusion of multiple image modalities.
Findings
Achieved 65.56% accuracy on the VQA task.
Demonstrated the feasibility of multi-modal fusion in remote sensing.
Provided a flexible dataset pipeline for future research.
Abstract
The extraction of visual features is an essential step in Visual Question Answering (VQA). Building a good visual representation of the analyzed scene is indeed one of the essential keys for the system to be able to correctly understand the latter in order to answer complex questions. In many fields such as remote sensing, the visual feature extraction step could benefit significantly from leveraging different image modalities carrying complementary spectral, spatial and contextual information. In this work, we propose to add multiple image modalities to VQA in the particular context of remote sensing, leading to a novel task for the computer vision community. To this end, we introduce a new VQA dataset, named TAMMI (Text and Multi-Modal Imagery) with diverse questions on scenes described by three different modalities (very high resolution RGB, multi-spectral imaging data and synthetic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications
MethodsVisualBERT
