Reframing Spatial Reasoning Evaluation in Language Models: A Real-World   Simulation Benchmark for Qualitative Reasoning

Fangjun Li; David C. Hogg; Anthony G. Cohn

arXiv:2405.15064·cs.CL·May 27, 2024

Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning

Fangjun Li, David C. Hogg, Anthony G. Cohn

PDF

Open Access 2 Repos 1 Datasets

TL;DR

This paper introduces a realistic 3D simulation benchmark for evaluating qualitative spatial reasoning in language models, addressing limitations of previous simplified tests and highlighting current models' challenges with complex spatial tasks.

Contribution

The paper presents a novel, simulation-based benchmark for qualitative spatial reasoning, including a logic-based consistency tool and diverse real-world scenarios for more effective evaluation.

Findings

01

Advanced LMs struggle with multi-hop spatial reasoning.

02

Models have difficulty interpreting mixed view descriptions.

03

The benchmark reveals specific strengths and limitations of current LMs.

Abstract

Spatial reasoning plays a vital role in both human cognition and machine intelligence, prompting new research into language models' (LMs) capabilities in this regard. However, existing benchmarks reveal shortcomings in evaluating qualitative spatial reasoning (QSR). These benchmarks typically present oversimplified scenarios or unclear natural language descriptions, hindering effective evaluation. We present a novel benchmark for assessing QSR in LMs, which is grounded in realistic 3D simulation data, offering a series of diverse room layouts with various objects and their spatial relationships. This approach provides a more detailed and context-rich narrative for spatial reasoning evaluation, diverging from traditional, toy-task-oriented scenarios. Our benchmark encompasses a broad spectrum of qualitative spatial relationships, including topological, directional, and distance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Fangjun/RoomSpace
dataset· 16 dl
16 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Semantic Web and Ontologies