Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering
Min Peng, Chongyang Wang, Yuan Gao, Yu Shi, Xiang-Dong Zhou

TL;DR
This paper introduces a Multilevel Hierarchical Network with multiscale sampling for VideoQA, effectively integrating multiscale visual appearance and motion information to improve question answering accuracy.
Contribution
It proposes a novel MHN architecture with RMI and PVR modules that leverage multiscale sampling to enhance visual understanding in VideoQA tasks.
Findings
Achieved superior performance on three VideoQA datasets.
Validated the effectiveness of multiscale sampling and hierarchical processing.
Outperformed previous state-of-the-art methods.
Abstract
Video question answering (VideoQA) is challenging given its multimodal combination of visual understanding and natural language processing. While most existing approaches ignore the visual appearance-motion information at different temporal scales, it is unknown how to incorporate the multilevel processing capacity of a deep learning model with such multiscale information. Targeting these issues, this paper proposes a novel Multilevel Hierarchical Network (MHN) with multiscale sampling for VideoQA. MHN comprises two modules, namely Recurrent Multimodal Interaction (RMI) and Parallel Visual Reasoning (PVR). With a multiscale sampling, RMI iterates the interaction of appearance-motion information at each scale and the question embeddings to build the multilevel question-guided visual representations. Thereon, with a shared transformer encoder, PVR infers the visual cues at each level in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
