Multi-object event graph representation learning for Video Question   Answering

Yanan Wang; Shuichiro Haruta; Donghuo Zeng; Julio Vizcarra; and Mori Kurokawa

arXiv:2409.07747·cs.CV·September 13, 2024

Multi-object event graph representation learning for Video Question Answering

Yanan Wang, Shuichiro Haruta, Donghuo Zeng, Julio Vizcarra, and Mori Kurokawa

PDF

Open Access

TL;DR

This paper introduces CLanG, a contrastive learning method using multi-layer GNNs to improve video question answering by better capturing complex multi-object events, resulting in higher accuracy on challenging datasets.

Contribution

The paper proposes a novel contrastive learning approach with GNN-cluster modules to enhance multi-object event understanding in VideoQA, outperforming existing methods.

Findings

01

Achieves up to 2.2% higher accuracy on NExT-QA and TGIF-QA-R datasets.

02

Outperforms baselines in causal and temporal reasoning by 2.8%.

03

Effectively models complex multi-object events in videos.

Abstract

Video question answering (VideoQA) is a task to predict the correct answer to questions posed about a given video. The system must comprehend spatial and temporal relationships among objects extracted from videos to perform causal and temporal reasoning. While prior works have focused on modeling individual object movements using transformer-based methods, they falter when capturing complex scenarios involving multiple objects (e.g., "a boy is throwing a ball in a hoop"). We propose a contrastive language event graph representation learning method called CLanG to address this limitation. Aiming to capture event representations associated with multiple objects, our method employs a multi-layer GNN-cluster module for adversarial graph representation learning, enabling contrastive learning between the question text and its relevant multi-object event graph. Our method outperforms a strong…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Graph Neural Networks · Topic Modeling

MethodsContrastive Learning