ArrowGEV: Grounding Events in Video via Learning the Arrow of Time

Fangxu Yu; Ziyao Lu; Liqiang Niu; Fandong Meng; Jie Zhou

arXiv:2601.06559·cs.CV·April 17, 2026

ArrowGEV: Grounding Events in Video via Learning the Arrow of Time

Fangxu Yu, Ziyao Lu, Liqiang Niu, Fandong Meng, Jie Zhou

PDF

TL;DR

ArrowGEV is a reinforcement learning framework that enhances video event grounding by explicitly modeling temporal directionality, inspired by the arrow of time, leading to better accuracy and understanding.

Contribution

It introduces a novel approach to incorporate temporal directionality into vision-language models for improved event grounding and reasoning in videos.

Findings

01

Improves grounding precision in videos.

02

Enhances recognition of temporal directionality.

03

Boosts overall video understanding and reasoning.

Abstract

Grounding events in videos serves as a fundamental capability in video analysis. While Vision Language Models (VLMs) are increasingly employed for this task, existing approaches predominantly train models to associate events with timestamps in the forward video only. This paradigm hinders VLMs from capturing the inherent temporal structure and directionality of events, thereby limiting robustness and generalization. To address this limitation, inspired by the arrow of time in physics, which characterizes the intrinsic directionality of temporal processes, we propose ArrowGEV, a reinforcement learning framework that explicitly models temporal directionality in events to improve both event grounding and temporal directionality understanding in VLMs. Specifically, we categorize events into time-sensitive (e.g., putting down a bag) and time-insensitive (e.g., holding a towel in the left…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.