Assessing Situational and Spatial Awareness of VLMs with Synthetically Generated Video

Pascal Benschop; Justin Dauwels; Jan van Gemert

arXiv:2601.15780·cs.CV·January 23, 2026

Assessing Situational and Spatial Awareness of VLMs with Synthetically Generated Video

Pascal Benschop, Justin Dauwels, Jan van Gemert

PDF

Open Access

TL;DR

This paper introduces a synthetic video benchmark to evaluate the spatial and situational awareness of vision language models, revealing their limited performance and proposing simple aids to improve understanding.

Contribution

The paper presents a novel synthetic benchmark for assessing VLMs' spatial and situational reasoning, highlighting their weaknesses and providing tools for further research.

Findings

01

VLMs perform only slightly above chance on the benchmark.

02

Stable color cues can partly reduce role confusion.

03

The benchmark is applicable to any video classification model.

Abstract

Spatial reasoning in vision language models (VLMs) remains fragile when semantics hinge on subtle temporal or geometric cues. We introduce a synthetic benchmark that probes two complementary skills: situational awareness (recognizing whether an interaction is harmful or benign) and spatial awareness (tracking who does what to whom, and reasoning about relative positions and motion). Through minimal video pairs, we test three challenges: distinguishing violence from benign activity, binding assailant roles across viewpoints, and judging fine-grained trajectory alignment. While we evaluate recent VLMs in a training-free setting, the benchmark is applicable to any video classification model. Results show performance only slightly above chance across tasks. A simple aid, stable color cues, partly reduces assailant role confusions but does not resolve the underlying weakness. By releasing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Social Robot Interaction and HRI