Understanding Long Videos via LLM-Powered Entity Relation Graphs
Meng Chu, Yicong Li, Tat-Seng Chua

TL;DR
This paper introduces GraphVideoAgent, a novel system that uses graph-based object tracking combined with large language models to improve understanding of long videos by capturing temporal relationships and interactions more effectively.
Contribution
The paper presents a dynamic graph framework integrated with LLMs for enhanced long video analysis, outperforming existing methods in accuracy and efficiency.
Findings
Achieved 2.2 improvement on EgoSchema dataset
Achieved 2.0 improvement on NExT-QA benchmark
Required analysis of only around 8 frames on average
Abstract
The analysis of extended video content poses unique challenges in artificial intelligence, particularly when dealing with the complexity of tracking and understanding visual elements across time. Current methodologies that process video frames sequentially struggle to maintain coherent tracking of objects, especially when these objects temporarily vanish and later reappear in the footage. A critical limitation of these approaches is their inability to effectively identify crucial moments in the video, largely due to their limited grasp of temporal relationships. To overcome these obstacles, we present GraphVideoAgent, a cutting-edge system that leverages the power of graph-based object tracking in conjunction with large language model capabilities. At its core, our framework employs a dynamic graph structure that maps and monitors the evolving relationships between visual entities…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Machine Learning in Healthcare · Generative Adversarial Networks and Image Synthesis
