HRVQA: A Visual Question Answering Benchmark for High-Resolution Aerial Images
Kun Li, George Vosselman, Michael Ying Yang

TL;DR
This paper introduces HRVQA, a large high-resolution aerial image dataset with over a million QA pairs, and proposes GFTransformer, a new model that advances VQA performance in aerial imagery.
Contribution
The paper presents a new high-resolution aerial image dataset with extensive annotations and a novel GFTransformer model with gated attention for improved VQA accuracy.
Findings
The dataset is highly challenging, especially for attribute-related questions.
GFTransformer outperforms previous state-of-the-art models on HRVQA.
The dataset and code will be publicly available.
Abstract
Visual question answering (VQA) is an important and challenging multimodal task in computer vision. Recently, a few efforts have been made to bring VQA task to aerial images, due to its potential real-world applications in disaster monitoring, urban planning, and digital earth product generation. However, not only the huge variation in the appearance, scale and orientation of the concepts in aerial images, but also the scarcity of the well-annotated datasets restricts the development of VQA in this domain. In this paper, we introduce a new dataset, HRVQA, which provides collected 53512 aerial images of 1024*1024 pixels and semi-automatically generated 1070240 QA pairs. To benchmark the understanding capability of VQA models for aerial images, we evaluate the relevant methods on HRVQA. Moreover, we propose a novel model, GFTransformer, with gated attention modules and a mutual fusion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
