Time Blindness: Why Video-Language Models Can't See What Humans Can?

Ujjwal Upadhyay; Mukul Ranjan; Zhiqiang Shen; Mohamed Elhoseiny

arXiv:2505.24867·cs.CV·April 30, 2026

Time Blindness: Why Video-Language Models Can't See What Humans Can?

Ujjwal Upadhyay, Mukul Ranjan, Zhiqiang Shen, Mohamed Elhoseiny

PDF

2 Repos 2 Datasets

TL;DR

Video-language models excel at spatial understanding but struggle with purely temporal patterns, revealing a significant gap compared to human perception, especially in noise-like sequences.

Contribution

We introduce SpookyBench, a benchmark highlighting VLMs' inability to interpret temporal sequences without spatial cues, and analyze the limitations across models.

Findings

01

Humans recognize patterns in noise-like sequences with over 98% accuracy.

02

State-of-the-art VLMs achieve 0% accuracy on SpookyBench.

03

Temporal understanding in VLMs degrades faster than in humans under low spatial SNR.

Abstract

Recent advances in vision-language models (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. However, when spatial information is obscured, these models struggle to capture purely temporal patterns. We introduce $SpookyBench$ , a benchmark where information is encoded solely in temporal sequences of noise-like frames, mirroring natural phenomena from biological signaling to covert communication. Interestingly, while humans can recognize shapes, text, and patterns in these sequences with over 98% accuracy, state-of-the-art VLMs achieve 0% accuracy. This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues. Furthermore, when trained in data sets with low spatial signal-to-noise ratios (SNR), temporal understanding of models degrades more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.