Glance and Focus: Memory Prompting for Multi-Event Video Question   Answering

Ziyi Bai; Ruiping Wang; Xilin Chen

arXiv:2401.01529·cs.CV·January 4, 2024·2 cites

Glance and Focus: Memory Prompting for Multi-Event Video Question Answering

Ziyi Bai, Ruiping Wang, Xilin Chen

PDF

Open Access 1 Repo 1 Video

TL;DR

The paper introduces the Glance-Focus model for Multi-Event Video Question Answering, using dynamic memory generation and multi-level attention to improve reasoning over complex videos, achieving state-of-the-art results.

Contribution

It proposes a novel memory prompting approach with dynamic event memories and multi-level attention for better reasoning in VideoQA tasks.

Findings

01

Achieves state-of-the-art results on four VideoQA benchmarks.

02

Outperforms large models in complex multi-event reasoning.

03

Demonstrates effectiveness of dynamic memory generation methods.

Abstract

Video Question Answering (VideoQA) has emerged as a vital tool to evaluate agents' ability to understand human daily behaviors. Despite the recent success of large vision language models in many multi-modal tasks, complex situation reasoning over videos involving multiple human-object interaction events still remains challenging. In contrast, humans can easily tackle it by using a series of episode memories as anchors to quickly locate question-related key moments for reasoning. To mimic this effective reasoning strategy, we propose the Glance-Focus model. One simple way is to apply an action detection model to predict a set of actions as key memories. However, these actions within a closed set vocabulary are hard to generalize to various video domains. Instead of that, we train an Encoder-Decoder to generate a set of dynamic event memories at the glancing stage. Apart from using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

byz0e/glance-focus
pytorchOfficial

Videos

Glance and Focus: Memory Prompting for Multi-Event Video Question Answering· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsSparse Evolutionary Training