Time Blindness: Why Video-Language Models Can't See What Humans Can?
Ujjwal Upadhyay, Mukul Ranjan, Zhiqiang Shen, Mohamed Elhoseiny

TL;DR
Video-language models excel at spatial understanding but struggle with purely temporal patterns, revealing a significant gap compared to human perception, especially in noise-like sequences.
Contribution
We introduce SpookyBench, a benchmark highlighting VLMs' inability to interpret temporal sequences without spatial cues, and analyze the limitations across models.
Findings
Humans recognize patterns in noise-like sequences with over 98% accuracy.
State-of-the-art VLMs achieve 0% accuracy on SpookyBench.
Temporal understanding in VLMs degrades faster than in humans under low spatial SNR.
Abstract
Recent advances in vision-language models (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. However, when spatial information is obscured, these models struggle to capture purely temporal patterns. We introduce , a benchmark where information is encoded solely in temporal sequences of noise-like frames, mirroring natural phenomena from biological signaling to covert communication. Interestingly, while humans can recognize shapes, text, and patterns in these sequences with over 98% accuracy, state-of-the-art VLMs achieve 0% accuracy. This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues. Furthermore, when trained in data sets with low spatial signal-to-noise ratios (SNR), temporal understanding of models degrades more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
