SPhyR: Spatial-Physical Reasoning Benchmark on Material Distribution
Philipp D. Siedler

TL;DR
SPhyR is a new benchmark dataset designed to evaluate large language models' ability to perform spatial and physical reasoning tasks related to material distribution based on topology optimization in 2D structures.
Contribution
The paper introduces SPhyR, a novel dataset for benchmarking LLMs' reasoning about physical and spatial properties in material distribution tasks without simulation tools.
Findings
LLMs can partially infer material distributions from given conditions.
The dataset reveals strengths and limitations of current models in physical reasoning.
SPhyR provides a new avenue for evaluating reasoning capabilities in structural design.
Abstract
We introduce a novel dataset designed to benchmark the physical and spatial reasoning capabilities of Large Language Models (LLM) based on topology optimization, a method for computing optimal material distributions within a design space under prescribed loads and supports. In this dataset, LLMs are provided with conditions such as 2D boundary, applied forces and supports, and must reason about the resulting optimal material distribution. The dataset includes a variety of tasks, ranging from filling in masked regions within partial structures to predicting complete material distributions. Solving these tasks requires understanding the flow of forces and the required material distribution under given constraints, without access to simulation tools or explicit physical models, challenging models to reason about structural stability and spatial organization. Our dataset targets the…
Peer Reviews
Decision·Submitted to ICLR 2026
The premise of the paper is intriguing and useful as a general benchmark, as it would certainly be beneficial to query whether models can reliably reason over physical forces, constraints, material interactions, and structural connectivity. Topology optimization seems like a great task through which to measure such reasoning capabilities for VLM/LLMs, especially because the pixel/voxelized format of the problem naturally translates to several modalities, including text serialization as posited i
1. My main concern is that the evaluation metrics seem far too simplistic. - This is especially true in the "Score" metric for the harder continuous scenario, where, for a ground-truth cell with value 0.6, model predictions of 0 or 0.7 would be marked equally wrong. This sort of all-or-nothing scoring approach based on string matching seems to encourage pattern matching or memorization over physically-based reasoning -- which is precisely the issue this paper claims to address. Relative differe
The main strength of the paper is that: - It is the first benchmark for this particular task of reasoning about optimal material distribution.
The main weaknesses in the paper are as follows: - It is unclear why would we need to benchmark LLMs for this particular task ? Are LLMs expected to provide more optimal answers than numerical solvers ? Is it going to be faster? - If we need to empower LLMs with physical reasoning ability, doesn't it make more sense to just give LLMs access to a numerical solver tool or a simulation tool ? Why is there a need for providing LLMs with the intrinsic ability to solve a physics problem ? - There is
- Addresses an underexplored gap in LLM evaluation by testing physically-grounded spatial reasoning rather than just linguistic or visual tasks. - Provides graduated difficulty levels (easy/hard) and multiple task variants (cell, row, column, full structure), enabling granular analysis of model capabilities - Evaluates 10 state-of-the-art models with multiple experimental setups (rotations, prompt variations) and well-defined metrics
- Restricted to small 10×10 grids and relatively simple 2D scenarios, which may not fully reveal model limitations or generalize to realistic structural problems. - Topology optimization can have multiple valid solutions; the paper doesn't address whether alternative plausible structures should be considered correct
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsManufacturing Process and Optimization · Machine Learning in Materials Science · Image Processing and 3D Reconstruction
