Visual Question Answering on 360{\deg} Images

Shih-Han Chou; Wei-Lun Chao; Wei-Sheng Lai; Min Sun; Ming-Hsuan Yang

arXiv:2001.03339·cs.CV·January 13, 2020·6 cites

Visual Question Answering on 360{\deg} Images

Shih-Han Chou, Wei-Lun Chao, Wei-Sheng Lai, Min Sun, Ming-Hsuan Yang

PDF

Open Access

TL;DR

This paper introduces VQA 360, a new dataset and task for visual question answering on 360-degree images, highlighting the challenges and proposing models that improve spatial understanding over traditional methods.

Contribution

The paper presents the first VQA 360 dataset, analyzes models tailored for 360 images, and establishes a benchmark for future research in this area.

Findings

01

Cubemap-based model outperforms equirectangular models.

02

Significant gap remains between human and machine performance.

03

Proposed models effectively utilize multi-resolution spatial information.

Abstract

In this work, we introduce VQA 360, a novel task of visual question answering on 360 images. Unlike a normal field-of-view image, a 360 image captures the entire visual content around the optical center of a camera, demanding more sophisticated spatial understanding and reasoning. To address this problem, we collect the first VQA 360 dataset, containing around 17,000 real-world image-question-answer triplets for a variety of question types. We then study two different VQA models on VQA 360, including one conventional model that takes an equirectangular image (with intrinsic distortion) as input and one dedicated model that first projects a 360 image onto cubemaps and subsequently aggregates the information from multiple spatial resolutions. We demonstrate that the cubemap-based model with multi-level fusion and attention diffusion performs favorably against other variants and the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning