CVT-Bench: Counterfactual Viewpoint Transformations Reveal Unstable Spatial Representations in Multimodal LLMs
Shanmukha Vellamcheti, Uday Kiran Kothapalli, Disharee Bhowmick, Sathyanarayanan N. Aakur

TL;DR
This paper introduces CVT-Bench, a benchmark for evaluating the stability of spatial representations in multimodal large language models under counterfactual viewpoint changes, revealing their instability despite high single-view accuracy.
Contribution
The paper presents a novel diagnostic benchmark for assessing relational consistency in MLLMs under hypothetical viewpoint transformations, highlighting the importance of representational structure.
Findings
MLLMs show systematic degradation under viewpoint changes.
Increasing representational structure improves stability.
Single-view accuracy overestimates spatial robustness.
Abstract
Multimodal large language models (MLLMs) achieve strong performance on single-view spatial reasoning tasks, yet it remains unclear whether they maintain stable spatial state representations under counterfactual viewpoint changes. We introduce a controlled diagnostic benchmark that evaluates relational consistency under hypothetical camera orbit transformations without re-rendering images. Across 100 synthetic scenes and 6,000 relational queries, we measure viewpoint consistency, 360{\deg} cycle agreement, and relational stability over sequential transformations. Despite high single-view accuracy, state-of-the-art MLLMs exhibit systematic degradation under counterfactual viewpoint changes, with frequent violations of cycle consistency and rapid decay in relational stability. We further evaluate multiple input representations, visual input, textual bounding boxes, and structured scene…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning
