Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
Shiho Matta, Lis Kanashiro Pereira, Peitao Han, Fei Cheng, Shigeru Kitazawa

TL;DR
This paper introduces AoT-PsyPhyBENCH, a benchmark to evaluate vision-language models' ability to understand temporal direction in videos, revealing significant gaps compared to human performance.
Contribution
The authors present a new psychophysics-based benchmark for assessing temporal reasoning in vision-language models, highlighting their deficiencies in understanding physical and causal temporal cues.
Findings
Most models perform near chance in judging temporal direction.
Even the best models lag behind humans on irreversible physical processes.
Models lack the inductive biases for temporal continuity and causal reasoning.
Abstract
Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and has not been adequately evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best model lags far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
