Hypo3D: Exploring Hypothetical Reasoning in 3D
Ye Mao, Weixun Luo, Junpeng Jing, Anlan Qiu, Krystian Mikolajczyk

TL;DR
Hypo3D introduces a benchmark for evaluating vision-language models' ability to perform hypothetical reasoning in 3D scenes without real-time data, highlighting current models' limitations in such reasoning tasks.
Contribution
This paper presents Hypo3D, the first benchmark for 3D hypothetical reasoning, and demonstrates the significant performance gap between state-of-the-art models and humans in this task.
Findings
State-of-the-art models perform poorly on Hypo3D tasks.
Models often fail to accurately reason about scene changes.
Humans outperform models significantly in hypothetical 3D reasoning.
Abstract
The rise of vision-language foundation models marks an advancement in bridging the gap between human and machine capabilities in 3D scene reasoning. Existing 3D reasoning benchmarks assume real-time scene accessibility, which is impractical due to the high cost of frequent scene updates. To this end, we introduce Hypothetical 3D Reasoning, namely Hypo3D, a benchmark designed to evaluate models' ability to reason without access to real-time scene data. Models need to imagine the scene state based on a provided change description before reasoning. Hypo3D is formulated as a 3D Visual Question Answering (VQA) benchmark, comprising 7,727 context changes across 700 indoor scenes, resulting in 14,885 question-answer pairs. An anchor-based world frame is established for all scenes, ensuring consistent reference to a global frame for directional terms in context changes and QAs. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSemantic Web and Ontologies
