Pano-AVQA: Grounded Audio-Visual Question Answering on 360$^\circ$   Videos

Heeseung Yun; Youngjae Yu; Wonsuk Yang; Kangil Lee; Gunhee Kim

arXiv:2110.05122·cs.CV·October 12, 2021

Pano-AVQA: Grounded Audio-Visual Question Answering on 360$^\circ$ Videos

Heeseung Yun, Youngjae Yu, Wonsuk Yang, Kangil Lee, Gunhee Kim

PDF

Open Access 1 Repo

TL;DR

Pano-AVQA introduces a large-scale dataset for grounded audio-visual question answering on 360° videos, enabling improved semantic understanding of panoramic scenes through novel question types and transformer-based models.

Contribution

The paper presents a new benchmark dataset, Pano-AVQA, with spherical spatial and audio-visual relation questions, and demonstrates the effectiveness of transformer models with specialized embeddings.

Findings

01

Transformer models with spherical embeddings improve understanding of panoramic scenes.

02

The dataset includes 5.4K videos with novel question-answer pairs.

03

Multimodal training enhances semantic comprehension of 360° environments.

Abstract

360 $^{\circ}$ videos convey holistic views for the surroundings of a scene. It provides audio-visual cues beyond pre-determined normal field of views and displays distinctive spatial relations on a sphere. However, previous benchmark tasks for panoramic videos are still limited to evaluate the semantic understanding of audio-visual relationships or spherical spatial property in surroundings. We propose a novel benchmark named Pano-AVQA as a large-scale grounded audio-visual question answering dataset on panoramic videos. Using 5.4K 360 $^{\circ}$ video clips harvested online, we collect two types of novel question-answer pairs with bounding-box grounding: spherical spatial relation QAs and audio-visual relation QAs. We train several transformer-based models from Pano-AVQA, where the results suggest that our proposed spherical spatial embeddings and multimodal training objectives fairly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hs-yn/panoavqa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and Audio Processing · Advanced Image and Video Retrieval Techniques