SPhyR: Spatial-Physical Reasoning Benchmark on Material Distribution

Philipp D. Siedler

arXiv:2505.16048·cs.AI·February 6, 2026

SPhyR: Spatial-Physical Reasoning Benchmark on Material Distribution

Philipp D. Siedler

PDF

Open Access 1 Repo 1 Datasets 3 Reviews

TL;DR

SPhyR is a new benchmark dataset designed to evaluate large language models' ability to perform spatial and physical reasoning tasks related to material distribution based on topology optimization in 2D structures.

Contribution

The paper introduces SPhyR, a novel dataset for benchmarking LLMs' reasoning about physical and spatial properties in material distribution tasks without simulation tools.

Findings

01

LLMs can partially infer material distributions from given conditions.

02

The dataset reveals strengths and limitations of current models in physical reasoning.

03

SPhyR provides a new avenue for evaluating reasoning capabilities in structural design.

Abstract

We introduce a novel dataset designed to benchmark the physical and spatial reasoning capabilities of Large Language Models (LLM) based on topology optimization, a method for computing optimal material distributions within a design space under prescribed loads and supports. In this dataset, LLMs are provided with conditions such as 2D boundary, applied forces and supports, and must reason about the resulting optimal material distribution. The dataset includes a variety of tasks, ranging from filling in masked regions within partial structures to predicting complete material distributions. Solving these tasks requires understanding the flow of forces and the required material distribution under given constraints, without access to simulation tools or explicit physical models, challenging models to reason about structural stability and spatial organization. Our dataset targets the…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

The premise of the paper is intriguing and useful as a general benchmark, as it would certainly be beneficial to query whether models can reliably reason over physical forces, constraints, material interactions, and structural connectivity. Topology optimization seems like a great task through which to measure such reasoning capabilities for VLM/LLMs, especially because the pixel/voxelized format of the problem naturally translates to several modalities, including text serialization as posited i

Weaknesses

1. My main concern is that the evaluation metrics seem far too simplistic. - This is especially true in the "Score" metric for the harder continuous scenario, where, for a ground-truth cell with value 0.6, model predictions of 0 or 0.7 would be marked equally wrong. This sort of all-or-nothing scoring approach based on string matching seems to encourage pattern matching or memorization over physically-based reasoning -- which is precisely the issue this paper claims to address. Relative differe

Reviewer 02Rating 2Confidence 3

Strengths

The main strength of the paper is that: - It is the first benchmark for this particular task of reasoning about optimal material distribution.

Weaknesses

The main weaknesses in the paper are as follows: - It is unclear why would we need to benchmark LLMs for this particular task ? Are LLMs expected to provide more optimal answers than numerical solvers ? Is it going to be faster? - If we need to empower LLMs with physical reasoning ability, doesn't it make more sense to just give LLMs access to a numerical solver tool or a simulation tool ? Why is there a need for providing LLMs with the intrinsic ability to solve a physics problem ? - There is

Reviewer 03Rating 8Confidence 2

Strengths

- Addresses an underexplored gap in LLM evaluation by testing physically-grounded spatial reasoning rather than just linguistic or visual tasks. - Provides graduated difficulty levels (easy/hard) and multiple task variants (cell, row, column, full structure), enabling granular analysis of model capabilities - Evaluates 10 state-of-the-art models with multiple experimental setups (rotations, prompt variations) and well-defined metrics

Weaknesses

- Restricted to small 10×10 grids and relatively simple 2D scenarios, which may not fully reveal model limitations or generalize to realistic structural problems. - Topology optimization can have multiple valid solutions; the paper doesn't address whether alternative plausible structures should be considered correct

Code & Models

Repositories

philippds/sphyr
noneOfficial

Datasets

philippds/SPhyR
dataset· 263 dl
263 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsManufacturing Process and Optimization · Machine Learning in Materials Science · Image Processing and 3D Reconstruction