HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation

Trong-Thuan Nguyen; Pha Nguyen; Jackson Cothren; Alper Yilmaz; Khoa; Luu

arXiv:2411.18042·cs.CV·April 1, 2025

HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation

Trong-Thuan Nguyen, Pha Nguyen, Jackson Cothren, Alper Yilmaz, Khoa, Luu

PDF

Open Access

TL;DR

HyperGLM introduces a unified HyperGraph model that enhances reasoning about complex multi-object interactions in videos, significantly improving performance across multiple vision-language tasks.

Contribution

The paper presents HyperGLM, a novel HyperGraph framework that integrates spatial and causal relationships for improved video scene understanding and reasoning.

Findings

01

Outperforms state-of-the-art methods on five tasks

02

Effectively models complex multi-object interactions

03

Introduces a new large-scale Video Scene Graph Reasoning dataset

Abstract

Multimodal LLMs have advanced vision-language tasks but still struggle with understanding video scenes. To bridge this gap, Video Scene Graph Generation (VidSGG) has emerged to capture multi-object relationships across video frames. However, prior methods rely on pairwise connections, limiting their ability to handle complex multi-object interactions and reasoning. To this end, we propose Multimodal LLMs on a Scene HyperGraph (HyperGLM), promoting reasoning about multi-way interactions and higher-order relationships. Our approach uniquely integrates entity scene graphs, which capture spatial relationships between objects, with a procedural graph that models their causal transitions, forming a unified HyperGraph. Significantly, HyperGLM enables reasoning by injecting this unified HyperGraph into LLMs. Additionally, we introduce a new Video Scene Graph Reasoning (VSGR) dataset featuring…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition