GHR-VQA: Graph-guided Hierarchical Relational Reasoning for Video Question Answering
Dionysia Danai Brilli, Dimitrios Mallis, Vassilis Pitsikalis, Petros Maragos

TL;DR
GHR-VQA introduces a graph-based hierarchical reasoning framework for video question answering that models human-object interactions across frames, improving interpretability and reasoning accuracy.
Contribution
The paper presents a novel scene graph-based hierarchical reasoning approach for Video QA, emphasizing human-centric interactions and cross-frame reasoning.
Findings
Achieves 7.3% improvement in object-relation reasoning on AGQA dataset.
Effectively models human-object interactions with scene graphs across video frames.
Enhances interpretability through explicit human-rooted structure.
Abstract
We propose GHR-VQA, Graph-guided Hierarchical Relational Reasoning for Video Question Answering (Video QA), a novel human-centric framework that incorporates scene graphs to capture intricate human-object interactions within video sequences. Unlike traditional pixel-based methods, each frame is represented as a scene graph and human nodes across frames are linked to a global root, forming the video-level graph and enabling cross-frame reasoning centered on human actors. The video-level graphs are then processed by Graph Neural Networks (GNNs), transforming them into rich, context-aware embeddings for efficient processing. Finally, these embeddings are integrated with question features in a hierarchical network operating across different abstraction levels, enhancing both local and global understanding of video content. This explicit human-rooted structure enhances interpretability by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Graph Neural Networks · Human Pose and Action Recognition
