Assessing Situational and Spatial Awareness of VLMs with Synthetically Generated Video
Pascal Benschop, Justin Dauwels, Jan van Gemert

TL;DR
This paper introduces a synthetic video benchmark to evaluate the spatial and situational awareness of vision language models, revealing their limited performance and proposing simple aids to improve understanding.
Contribution
The paper presents a novel synthetic benchmark for assessing VLMs' spatial and situational reasoning, highlighting their weaknesses and providing tools for further research.
Findings
VLMs perform only slightly above chance on the benchmark.
Stable color cues can partly reduce role confusion.
The benchmark is applicable to any video classification model.
Abstract
Spatial reasoning in vision language models (VLMs) remains fragile when semantics hinge on subtle temporal or geometric cues. We introduce a synthetic benchmark that probes two complementary skills: situational awareness (recognizing whether an interaction is harmful or benign) and spatial awareness (tracking who does what to whom, and reasoning about relative positions and motion). Through minimal video pairs, we test three challenges: distinguishing violence from benign activity, binding assailant roles across viewpoints, and judging fine-grained trajectory alignment. While we evaluate recent VLMs in a training-free setting, the benchmark is applicable to any video classification model. Results show performance only slightly above chance across tasks. A simple aid, stable color cues, partly reduces assailant role confusions but does not resolve the underlying weakness. By releasing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Social Robot Interaction and HRI
