Multi-object event graph representation learning for Video Question Answering
Yanan Wang, Shuichiro Haruta, Donghuo Zeng, Julio Vizcarra, and Mori Kurokawa

TL;DR
This paper introduces CLanG, a contrastive learning method using multi-layer GNNs to improve video question answering by better capturing complex multi-object events, resulting in higher accuracy on challenging datasets.
Contribution
The paper proposes a novel contrastive learning approach with GNN-cluster modules to enhance multi-object event understanding in VideoQA, outperforming existing methods.
Findings
Achieves up to 2.2% higher accuracy on NExT-QA and TGIF-QA-R datasets.
Outperforms baselines in causal and temporal reasoning by 2.8%.
Effectively models complex multi-object events in videos.
Abstract
Video question answering (VideoQA) is a task to predict the correct answer to questions posed about a given video. The system must comprehend spatial and temporal relationships among objects extracted from videos to perform causal and temporal reasoning. While prior works have focused on modeling individual object movements using transformer-based methods, they falter when capturing complex scenarios involving multiple objects (e.g., "a boy is throwing a ball in a hoop"). We propose a contrastive language event graph representation learning method called CLanG to address this limitation. Aiming to capture event representations associated with multiple objects, our method employs a multi-layer GNN-cluster module for adversarial graph representation learning, enabling contrastive learning between the question text and its relevant multi-object event graph. Our method outperforms a strong…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Graph Neural Networks · Topic Modeling
MethodsContrastive Learning
