GamiBench: Evaluating Spatial Reasoning and 2D-to-3D Planning Capabilities of MLLMs with Origami Folding Tasks

Ryan Spencer; Roey Yaari; Ritvik Vemavarapu; Joyce Yang; Steven Ngo; Utkarsh Sharma

arXiv:2512.22207·cs.AI·December 30, 2025

GamiBench: Evaluating Spatial Reasoning and 2D-to-3D Planning Capabilities of MLLMs with Origami Folding Tasks

Ryan Spencer, Roey Yaari, Ritvik Vemavarapu, Joyce Yang, Steven Ngo, Utkarsh Sharma

PDF

Open Access 1 Datasets 2 Videos

TL;DR

GamiBench is a comprehensive benchmark for evaluating multimodal large language models' spatial reasoning and 2D-to-3D planning abilities using origami folding tasks, highlighting current model limitations.

Contribution

Introduces GamiBench, a novel benchmark with new metrics for assessing spatial reasoning and 2D-to-3D planning in MLLMs through origami-inspired tasks.

Findings

01

Leading models like GPT-5 and Gemini-2.5-Pro struggle with spatial understanding.

02

GamiBench measures cross-view consistency and physical feasibility.

03

The benchmark provides a standardized framework for geometric reasoning evaluation.

Abstract

Multimodal large language models (MLLMs) are proficient in perception and instruction-following, but they still struggle with spatial reasoning: the ability to mentally track and manipulate objects across multiple views and over time. Spatial reasoning is a key component of human intelligence, but most existing benchmarks focus on static images or final outputs, failing to account for the sequential and viewpoint-dependent nature of this skill. To close this gap, we introduce GamiBench, a benchmark designed to evaluate spatial reasoning and 2D-to-3D planning in MLLMs through origami-inspired folding tasks. GamiBench includes 186 regular and 186 impossible 2D crease patterns paired with their corresponding 3D folded shapes, produced from six distinct viewpoints across three visual question-answering (VQA) tasks: predicting 3D fold configurations, distinguishing valid viewpoints, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

stvngo/GamiBench
dataset· 757 dl
757 dl

Videos

GamiBench: Evaluating Spatial Reasoning and 2D-to-3D Planning Capabilities of MLLMs with Origami Folding Tasks· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Artificial Intelligence in Games · Spatial Cognition and Navigation