Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
Ziyi Bai, Ruiping Wang, Xilin Chen

TL;DR
The paper introduces the Glance-Focus model for Multi-Event Video Question Answering, using dynamic memory generation and multi-level attention to improve reasoning over complex videos, achieving state-of-the-art results.
Contribution
It proposes a novel memory prompting approach with dynamic event memories and multi-level attention for better reasoning in VideoQA tasks.
Findings
Achieves state-of-the-art results on four VideoQA benchmarks.
Outperforms large models in complex multi-event reasoning.
Demonstrates effectiveness of dynamic memory generation methods.
Abstract
Video Question Answering (VideoQA) has emerged as a vital tool to evaluate agents' ability to understand human daily behaviors. Despite the recent success of large vision language models in many multi-modal tasks, complex situation reasoning over videos involving multiple human-object interaction events still remains challenging. In contrast, humans can easily tackle it by using a series of episode memories as anchors to quickly locate question-related key moments for reasoning. To mimic this effective reasoning strategy, we propose the Glance-Focus model. One simple way is to apply an action detection model to predict a set of actions as key memories. However, these actions within a closed set vocabulary are hard to generalize to various video domains. Instead of that, we train an Encoder-Decoder to generate a set of dynamic event memories at the glancing stage. Apart from using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsSparse Evolutionary Training
