GHR-VQA: Graph-guided Hierarchical Relational Reasoning for Video Question Answering

Dionysia Danai Brilli; Dimitrios Mallis; Vassilis Pitsikalis; Petros Maragos

arXiv:2511.20201·cs.CV·November 26, 2025

GHR-VQA: Graph-guided Hierarchical Relational Reasoning for Video Question Answering

Dionysia Danai Brilli, Dimitrios Mallis, Vassilis Pitsikalis, Petros Maragos

PDF

Open Access

TL;DR

GHR-VQA introduces a graph-based hierarchical reasoning framework for video question answering that models human-object interactions across frames, improving interpretability and reasoning accuracy.

Contribution

The paper presents a novel scene graph-based hierarchical reasoning approach for Video QA, emphasizing human-centric interactions and cross-frame reasoning.

Findings

01

Achieves 7.3% improvement in object-relation reasoning on AGQA dataset.

02

Effectively models human-object interactions with scene graphs across video frames.

03

Enhances interpretability through explicit human-rooted structure.

Abstract

We propose GHR-VQA, Graph-guided Hierarchical Relational Reasoning for Video Question Answering (Video QA), a novel human-centric framework that incorporates scene graphs to capture intricate human-object interactions within video sequences. Unlike traditional pixel-based methods, each frame is represented as a scene graph and human nodes across frames are linked to a global root, forming the video-level graph and enabling cross-frame reasoning centered on human actors. The video-level graphs are then processed by Graph Neural Networks (GNNs), transforming them into rich, context-aware embeddings for efficient processing. Finally, these embeddings are integrated with question features in a hierarchical network operating across different abstraction levels, enhancing both local and global understanding of video content. This explicit human-rooted structure enhances interpretability by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Graph Neural Networks · Human Pose and Action Recognition