TL;DR
LocationReasoner is a benchmark for evaluating large language models' reasoning skills in real-world site selection, revealing current models' limited performance and challenges in complex spatial reasoning tasks.
Contribution
The paper introduces LocationReasoner, a novel benchmark with tools and verification for assessing LLMs' reasoning in real-world spatial and logistical scenarios.
Findings
State-of-the-art models show limited improvement over predecessors.
OpenAI o4 model fails on 30% of site selection tasks.
Agentic strategies like ReAct can worsen outcomes due to over-reasoning.
Abstract
Recent advances in large language models (LLMs), particularly those enhanced through reinforced post-training, have demonstrated impressive reasoning capabilities, as exemplified by models such as OpenAI o1 and DeepSeek-R1. However, these capabilities are predominantly benchmarked on domains like mathematical problem solving and code generation, leaving open the question of whether such reasoning skills generalize to complex real-world scenarios. In this paper, we introduce LocationReasoner, a benchmark designed to evaluate LLMs' reasoning abilities in the context of real-world site selection, where models must identify feasible locations by reasoning over diverse and complicated spatial, environmental, and logistic constraints. The benchmark covers carefully crafted queries of varying difficulty levels and is supported by a sandbox environment with in-house tools for constraint-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
