Visual Question Answering in Remote Sensing with Cross-Attention and   Multimodal Information Bottleneck

Jayesh Songara; Shivam Pande; Shabnam Choudhury; Biplab Banerjee and; Rajbabu Velmurugan

arXiv:2306.14264·cs.CV·June 27, 2023

Visual Question Answering in Remote Sensing with Cross-Attention and Multimodal Information Bottleneck

Jayesh Songara, Shivam Pande, Shabnam Choudhury, Biplab Banerjee and, Rajbabu Velmurugan

PDF

Open Access

TL;DR

This paper introduces a cross-attention and information bottleneck approach for visual question answering in remote sensing, effectively handling high-dimensional data and multimodal information to improve accuracy.

Contribution

It proposes a novel combination of cross-attention and information maximization to enhance VQA performance in remote sensing images.

Findings

01

Achieved 79.11% and 73.87% accuracy on high-resolution datasets.

02

Achieved 85.98% accuracy on low-resolution datasets.

03

Demonstrated effectiveness of the method across different resolutions.

Abstract

In this research, we deal with the problem of visual question answering (VQA) in remote sensing. While remotely sensed images contain information significant for the task of identification and object detection, they pose a great challenge in their processing because of high dimensionality, volume and redundancy. Furthermore, processing image information jointly with language features adds additional constraints, such as mapping the corresponding image and language features. To handle this problem, we propose a cross attention based approach combined with information maximization. The CNN-LSTM based cross-attention highlights the information in the image and language modalities and establishes a connection between the two, while information maximization learns a low dimensional bottleneck layer, that has all the relevant information required to carry out the VQA task. We evaluate our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning