Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs
Peitao Han, Fei Cheng, Lis K. Pereira, Qianying Liu, Shigeru Kitazawa

TL;DR
This paper investigates how Video-LLMs process temporal information, identifies bottlenecks in information flow, and proposes architectural improvements that enable the model to surpass human performance on temporal reasoning tasks.
Contribution
It isolates the sources of temporal information loss in Video-LLMs and introduces a new architecture with temporal-aware encoding and preserved information transfer, achieving state-of-the-art results.
Findings
Video-centric encoders encode strong temporal signals, frame-centric do not.
Projector design critically affects temporal information transfer.
Proposed architecture surpasses human performance on AoT task with 98.1% accuracy.
Abstract
The Arrow-of-Time (AoT) task, determining whether a video plays forward or backward by recognizing temporal irreversibility, is one humans solve with near-perfect accuracy, yet frontier Video Large Language Models (Video-LLMs) perform only modestly above chance. This gap raises a key question: do visual backbones fail to encode temporal information, or does information bottleneck lie elsewhere in the Video-LLM architecture? We address this question by isolating the vision encoder from the Video-LLM and tracing temporal information across the encoder, projector, and LLM. We find that video-centric encoders with explicit temporal modeling encode strong temporal signals, whereas frame-centric encoders do not. However, when video-centric representations are passed through a standard Video-LLM architecture, performance often collapses, revealing a bottleneck of temporal information flow. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
