GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking
Zixu Cheng, Da Li, Jian Hu, Yuhang Zang, Ziquan Liu, Shaogang Gong, Wei Li

TL;DR
GraphThinker enhances video reasoning by explicitly modeling event relations and reinforcing visual grounding, significantly reducing hallucinations and improving localization and question-answering accuracy.
Contribution
It introduces a structured event representation and visual attention reward for reinforcement finetuning of multimodal models, addressing weak grounding and hallucination issues.
Findings
Over 4% improvement in IoU=0.3 for moment localization on RexTime.
9.8% reduction in temporal sequence hallucination on VidHalluc.
7.6% improvement in Binary QA accuracy by reducing action hallucination.
Abstract
Video reasoning requires a fine-grained understanding of the temporal dependencies and event-level relations between objects and events in videos. Current Multimodal Large Language Models (MLLMs) are prone to severe temporal hallucinations in video reasoning. An underlying cause of these hallucinations is weak visual-temporal grounding and the lack of explicit structure for modelling event relations. Models often rely on auxiliary text, such as dense captions, rather than explicitly anchoring their reasoning in actual visual evidence. However, these textual representations are inherently unstructured and fail to provide explicit causal constraints needed to guide the model's reasoning. In this work, we propose GraphThinker, a reinforcement finetuning method that constructs a structured event representation of a video and enforces visual grounding to jointly reduce reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
