Multilevel Hierarchical Network with Multiscale Sampling for Video   Question Answering

Min Peng; Chongyang Wang; Yuan Gao; Yu Shi; Xiang-Dong Zhou

arXiv:2205.04061·cs.CV·May 10, 2022·5 cites

Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering

Min Peng, Chongyang Wang, Yuan Gao, Yu Shi, Xiang-Dong Zhou

PDF

Open Access 1 Repo

TL;DR

This paper introduces a Multilevel Hierarchical Network with multiscale sampling for VideoQA, effectively integrating multiscale visual appearance and motion information to improve question answering accuracy.

Contribution

It proposes a novel MHN architecture with RMI and PVR modules that leverage multiscale sampling to enhance visual understanding in VideoQA tasks.

Findings

01

Achieved superior performance on three VideoQA datasets.

02

Validated the effectiveness of multiscale sampling and hierarchical processing.

03

Outperformed previous state-of-the-art methods.

Abstract

Video question answering (VideoQA) is challenging given its multimodal combination of visual understanding and natural language processing. While most existing approaches ignore the visual appearance-motion information at different temporal scales, it is unknown how to incorporate the multilevel processing capacity of a deep learning model with such multiscale information. Targeting these issues, this paper proposes a novel Multilevel Hierarchical Network (MHN) with multiscale sampling for VideoQA. MHN comprises two modules, namely Recurrent Multimodal Interaction (RMI) and Parallel Visual Reasoning (PVR). With a multiscale sampling, RMI iterates the interaction of appearance-motion information at each scale and the question embeddings to build the multilevel question-guided visual representations. Thereon, with a shared transformer encoder, PVR infers the visual cues at each level in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Mvrjustid/MHN-IJCAI22
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition