Stress Tests REVEAL Fragile Temporal and Visual Grounding in Video-Language Models

Sethuraman T V; Savya Khosla; Aditi Tiwari; Vidya Ganesh; Rakshana Jayaprakash; Aditya Jain; Vignesh Srinivasakumar; Onkar Kishor Susladkar; Srinidhi Sunkara; Aditya Shanmugham; Rakesh Vaideeswaran; Abbaas Alif Mohamed Nishar; Simon Jenni; Derek Hoiem

arXiv:2602.11244·cs.CV·February 13, 2026

Stress Tests REVEAL Fragile Temporal and Visual Grounding in Video-Language Models

Sethuraman T V, Savya Khosla, Aditi Tiwari, Vidya Ganesh, Rakshana Jayaprakash, Aditya Jain, Vignesh Srinivasakumar, Onkar Kishor Susladkar, Srinidhi Sunkara, Aditya Shanmugham, Rakesh Vaideeswaran, Abbaas Alif Mohamed Nishar, Simon Jenni, Derek Hoiem

PDF

Open Access

TL;DR

This paper introduces REVEAL{}, a benchmark that uncovers significant weaknesses in current Video-Language Models, revealing their poor understanding of video content, temporal dynamics, and motion through stress tests.

Contribution

The paper presents a new diagnostic benchmark with automated data generation to evaluate and expose fundamental weaknesses in contemporary VidLMs.

Findings

01

Models often misinterpret reversed scenes as forward

02

Models answer questions ignoring video content

03

Models struggle with camera motion and spatiotemporal occlusion

Abstract

This work investigates a fundamental question: Do Video-Language Models (VidLMs) robustly account for video content, temporal sequence, and motion? Our investigation shows that, surprisingly, they often do not. We introduce REVEAL{}, a diagnostic benchmark that probes fundamental weaknesses of contemporary VidLMs through five controlled stress tests; assessing temporal expectation bias, reliance on language-only shortcuts, video sycophancy, camera motion sensitivity, and robustness to spatiotemporal occlusion. We test leading open- and closed-source VidLMs and find that these models confidently describe reversed scenes as forward, answer questions while neglecting video content, agree with false claims, struggle with basic camera motion, and fail to aggregate temporal information amidst simple spatiotemporal masking. Humans, on the other hand, succeed at these tasks with ease. Alongside…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition