GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra
Mateusz Michalkiewicz, Anekha Sokhal, Tadeusz Michalkiewicz, Piotr Pawlikowski, Mahsa Baktashmotlagh, Varun Jampani, Guha Balakrishnan

TL;DR
This paper introduces GIQ, a benchmark for evaluating the geometric reasoning of vision models using synthetic and real polyhedra, revealing significant gaps in current models' understanding of 3D geometry.
Contribution
The paper presents GIQ, a new comprehensive benchmark for assessing geometric reasoning in vision models, highlighting their shortcomings in reconstructing and understanding complex 3D shapes.
Findings
Current models struggle with basic geometric shape reconstruction.
Foundation models show limited ability in detailed geometric differentiation.
Vision-language models have low accuracy in interpreting shape properties.
Abstract
Modern monocular 3D reconstruction methods and vision-language models (VLMs) demonstrate impressive results on standard benchmarks, yet recent works cast doubt on their true understanding of geometric properties. We introduce GOQ, a comprehensive benchmark specifically designed to evaluate the geometric reasoning capabilities of vision and vision-language foundation models. GIQ comprises synthetic and real-world images and corresponding 3D meshes of diverse polyhedra covering varying levels of complexity and symmetry, from Platonic, Archimedean, Johnson, and Catalan solids to stellations and compound shapes. Through systematic experiments involving monocular 3D reconstruction, 3D symmetry detection, mental rotation tests, and zero-shot shape classification tasks, we reveal significant shortcomings in current models. State-of-the-art reconstruction algorithms trained on extensive 3D…
Peer Reviews
Decision·ICLR 2026 Poster
1.The paper is well-motivated and addresses an important gap. Foundation models based on language and/or 2D images are unlikely to develop 3D understanding. The proposed benchmark demonstrates these limitations through controlled experiments. 2.The combination of synthetic and real images provide a well-controlled domain-shift test for evaluating geometric invariance versus appearance sensitivity.
1. The benchmark primarily evaluates 2D-pretrained encoders (e.g., CLIP, DINOv2, MAE, ConvNeXt) and 2D VLMs (e.g., GPT-4o, Claude, Gemini, Llava), which were never designed to represent 3D structure—so their poor performance is somewhat expected. The study highlights a limitation of 2D pretraining rather than establishing a hierarchy of geometry-aware capabilities. 2.Including geometry-native or 3D-aware models -- such as multi-view pretrained networks (VGGT, DUSt3R, MASt3R, etc.) or 3D-VLMs (Po
I believe this is an interesting and significantly meaningful paper. It sets a geometrical IQ (G-IQ) test for modern vision models and vlms in 3D reconstruction. GIQ fills a clear gap: existing 3D datasets (Objaverse, OmniObject3D, GSO) test recognition and reconstruction, but not reasoning about geometry or symmetry. Polyhedra are a brilliant choice since they offer mathematically clean ground truth, structured complexity, and interpretability, and the dataset design (Mitsuba renders + paper mo
Since the paper’s premise involves geometric reasoning, a simple human baseline (even small-scale) on the same tasks would help contextualise the human-level intelligence in G-IQ test, which will help to understand how far the modern vision models are away from human-level performance. For the zero-shot classification, it’s unclear how prompts and outputs were standardised. Did all models get the same prompt verbatim? Were answers normalised (e.g., “cube” vs. “hexahedron”)?
1. It provides the critical insight that a model's implicit ability to encode a feature (demonstrated by successful linear probing for symmetry) does not translate into explicit, robust geometric reasoning in other tasks. The results are quite interesting. 2. For GIQ, it uses polyhedra with well-defined properties (symmetry groups, face types) to provide precise, unambiguous ground truth for fine-grained geometric evaluation, which is lacking in large, existing 3D datasets.The constructed datase
1. One concern is the potential VLM prompt ambiguity. The exact zero-shot prompt methodology for testing VLMs is not detailed in the provided text. The reported low accuracy could potentially be influenced by sub-optimal or ambiguous prompting rather than a pure geometric failure of the models. 2. While polyhedra are rigorous, their highly regular and stylized nature may not fully capture the complexity and irregularity of general, arbitrary objects found in the real world. One potential concern
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization · Morphological variations and asymmetry
