CVT-Bench: Counterfactual Viewpoint Transformations Reveal Unstable Spatial Representations in Multimodal LLMs

Shanmukha Vellamcheti; Uday Kiran Kothapalli; Disharee Bhowmick; Sathyanarayanan N. Aakur

arXiv:2603.21114·cs.CV·March 24, 2026

CVT-Bench: Counterfactual Viewpoint Transformations Reveal Unstable Spatial Representations in Multimodal LLMs

Shanmukha Vellamcheti, Uday Kiran Kothapalli, Disharee Bhowmick, Sathyanarayanan N. Aakur

PDF

Open Access

TL;DR

This paper introduces CVT-Bench, a benchmark for evaluating the stability of spatial representations in multimodal large language models under counterfactual viewpoint changes, revealing their instability despite high single-view accuracy.

Contribution

The paper presents a novel diagnostic benchmark for assessing relational consistency in MLLMs under hypothetical viewpoint transformations, highlighting the importance of representational structure.

Findings

01

MLLMs show systematic degradation under viewpoint changes.

02

Increasing representational structure improves stability.

03

Single-view accuracy overestimates spatial robustness.

Abstract

Multimodal large language models (MLLMs) achieve strong performance on single-view spatial reasoning tasks, yet it remains unclear whether they maintain stable spatial state representations under counterfactual viewpoint changes. We introduce a controlled diagnostic benchmark that evaluates relational consistency under hypothetical camera orbit transformations without re-rendering images. Across 100 synthetic scenes and 6,000 relational queries, we measure viewpoint consistency, 360{\deg} cycle agreement, and relational stability over sequential transformations. Despite high single-view accuracy, state-of-the-art MLLMs exhibit systematic degradation under counterfactual viewpoint changes, with frequent violations of cycle consistency and rapid decay in relational stability. We further evaluate multiple input representations, visual input, textual bounding boxes, and structured scene…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning