Seeing the Arrow of Time in Large Multimodal Models
Zihui Xue, Mi Luo, Kristen Grauman

TL;DR
This paper introduces ArrowRL, a reinforcement learning strategy that enhances large multimodal models' ability to understand the arrow of time in videos, significantly improving temporal comprehension and question answering accuracy.
Contribution
The paper proposes ArrowRL, a novel RL-based training method with reverse rewards, and introduces AoTBench, a new benchmark for evaluating temporal understanding in video models.
Findings
ArrowRL improves temporal perception in LMMs.
Significant accuracy gains on AoTBench and VQA benchmarks.
Highlights the importance of AoT understanding in video models.
Abstract
The Arrow of Time (AoT)-time's irreversible flow shaping physical events-is fundamental to video comprehension, yet remains a significant challenge for modern large multimodal models (LMMs). Current LMMs struggle to perceive and utilize temporal directionality in video when responding to language queries, obstructing deeper temporal understanding. We tackle this deficiency by first providing a critical analysis of existing benchmarks and models. We then introduce ArrowRL, a reinforcement learning (RL)-based training strategy with an innovative reverse reward that instills AoT awareness by encouraging divergent video interpretations between forward and reversed visual frames. For rigorous evaluation, we additionally develop AoTBench, a new multi-faceted benchmark probing temporally challenging questions. Experiments show ArrowRL greatly advances temporal perception: it not only achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
