FloorplanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations
Fedor Rodionov, Abdelrahman Eldesokey, Michael Birsak, John Femiani, Bernard Ghanem, Peter Wonka

TL;DR
FloorplanQA is a benchmark designed to evaluate the spatial reasoning abilities of large-language models using structured indoor scene representations, revealing their limitations in understanding physical constraints and spatial coherence.
Contribution
We introduce FloorplanQA, a novel benchmark with structured representations to assess spatial reasoning in LLMs, highlighting their shortcomings in physical and spatial understanding.
Findings
Models succeed in shallow queries but struggle with physical constraints.
Models are robust to small spatial perturbations.
FloorplanQA exposes a blind spot in LLM spatial reasoning.
Abstract
We introduce FloorplanQA, a diagnostic benchmark for evaluating spatial reasoning in large-language models (LLMs). FloorplanQA is grounded in structured representations of indoor scenes, such as (e.g., kitchens, living rooms, bedrooms, bathrooms, and others), encoded symbolically in JSON or XML layouts. The benchmark covers core spatial tasks, including distance measurement, visibility, path finding, and object placement within constrained spaces. Our results across a variety of frontier open-source and commercial LLMs reveal that while models may succeed in shallow queries, they often fail to respect physical constraints, preserve spatial coherence, though they remain mostly robust to small spatial perturbations. FloorplanQA uncovers a blind spot in today's LLMs: inconsistent reasoning about indoor layouts. We hope this benchmark inspires new work on language models that can accurately…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
- A large dataset containing both synthetic and hand designed samples of structural design schematics in a descriptive json / xml format - The dataset is well designed to cover a lot of different cases such as room shapes, object placement, collision, etc. - Robust evaluation which shows current capabilities of llms for quantitative reasoning in spatial questions. - Well organized paper describing the data generation and evaluation process
- The benchmark only evaluates the internal quantitative capabilities and doesn't consider tool use or agentic workflows. LLMs are bad at generating high precision quantitative answers such as "distance between two points". It would be better to see how well the model can plan to reach the objective, but perform the mathematical calculations either using code or external tools like a calculator for better precision. - The authors haven't evaluated any VLLMs with their benchmark. Since this is a
1. Symbolic input, tool-free setup that isolates pure geometric/topological reasoning without visual noise or help from external solvers. 2. Comprehensive coverage across metric, topological, and action/path tasks, including Free Space, Max Box, Placement, Visibility, and Shortest Path. 3. Strong comparability via automation: strict output formats and tolerance thresholds; tailored scoring rules for numbers, sets, and sequences (e.g., 2–5% tolerances, set matching, Fréchet threshold with colli
1. Gap to real-world perception/interaction: purely symbolic floor plans omit imagery, noise, and perception errors, limiting ecological validity for embodied/vision tasks. 2. Planar-geometry focus: limited coverage of richer functional metrics (e.g., door flow, dynamic crowds, reachability and behavior constraints). 3. Sensitive to long-context/token budgets: truncation and formatting issues materially affect outcomes.
- The paper proposes a publicly available dataset for geometric representations of room layouts, enabling the evaluation of LLMs’ spatial and layout understanding. - An automatic pipeline is introduced to generate synthetic layouts using LLMs, demonstrating strong potential for building scalable benchmarks. - The proposed benchmark consists of both synthetic and realisitc layouts, providing rich information to be evaluate. - The proposed benchmark reveals the limitations of current LLMs in co
- The discussion of experimental results appears somewhat shallow (only based on numerical number), with limited analysis of why the models fail on specific tasks. Some failures qualitative analysis would be insightful on model behavior in this task. - No methods are proposed to improve model performance; even preliminary ideas or directions for enhancement would strengthen the contribution. - A discussion regarding models with visual training could be valuable, as it may reveal whether such t
1. The benchmark isolates symbolic spatial reasoning using structured floorplans rather than images, offering a complementary diagnostic perspective to vision-language tasks 2. A clear taxonomy spans metric, topological, and action-like tasks; answer formats and scoring rules are specified with tolerances for numeric and geometric checks 3. Encoding ablation (JSON vs. XML) suggests limited sensitivity to layout serialization, at least for selected tasks and models
1. 1800 layouts are generated by an LLM with rule-based filters, and synthetic objects are axis-aligned boxes; only 200 HSSD layouts introduce non–axis-aligned geometry. This raises concerns about realism, diversity, and potential generator biases 2. Questions are posed on single-room layouts; multi-room reasoning and dynamic layout changes (e.g., moving objects and re-evaluating visibility across rooms) are not covered. Room types are mainly kitchens, living rooms, and bedrooms 3. Numeric toler
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Data Mining Algorithms and Applications
