ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning

Yiming Zhang; Jiacheng Chen; Jiaqi Tan; Yongsen Mao; Wenhu Chen; Angel X. Chang

arXiv:2604.24300·cs.CV·May 7, 2026

ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning

Yiming Zhang, Jiacheng Chen, Jiaqi Tan, Yongsen Mao, Wenhu Chen, Angel X. Chang

PDF

1 Repo 1 Datasets

TL;DR

ReVSI is a new benchmark and protocol that improves the accuracy and diagnostic capability of evaluating visual spatial reasoning in vision-language models by re-annotating data and considering model input constraints.

Contribution

It introduces a re-annotated, bias-mitigated benchmark with multiple frame variants and object visibility metadata for more valid and diagnostic spatial intelligence evaluation.

Findings

01

VLMs exhibit systematic failure modes on ReVSI.

02

ReVSI reveals more accurate assessments of spatial reasoning.

03

Benchmark enables controlled analysis across different frame budgets.

Abstract

Current evaluations of spatial intelligence can be systematically invalid under modern vision-language model (VLM) settings. First, many benchmarks derive question-answer (QA) pairs from point-cloud-based 3D annotations originally curated for traditional 3D perception. When such annotations are treated as ground truth for video-based evaluation, reconstruction and annotation artifacts can miss objects that are clearly visible in the video, mislabel object identities, or corrupt geometry-dependent answers (e.g., size), yielding incorrect or ambiguous QA pairs. Second, evaluations often assume full-scene access, while many VLMs operate on sparsely sampled frames (e.g., 16-64), making many questions effectively unanswerable under the actual model inputs. We improve evaluation validity by introducing ReVSI, a benchmark and protocol that ensures each QA pair is answerable and correct under…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

3dlg-hcvc/revsi
github

Datasets

3dlg-hcvc/ReVSI
dataset· 1.7k dl
1.7k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.