Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-temporal Reasoning
Mohamed Aghzal, Erion Plaku, Ziyu Yao

TL;DR
This paper introduces PPNL, a new benchmark for evaluating large language models' spatial-temporal reasoning in path planning tasks, revealing strengths and limitations of models like GPT-4 and fine-tuned LLMs.
Contribution
The paper presents PPNL, a novel benchmark for spatial-temporal reasoning in path planning, and systematically evaluates LLMs' performance, highlighting their capabilities and challenges.
Findings
Few-shot GPT-4 shows promise in spatial reasoning.
Fine-tuned LLMs excel in in-distribution tasks but struggle with larger environments.
GPT-4 still fails in long-term temporal reasoning.
Abstract
Large language models (LLMs) have achieved remarkable success across a wide spectrum of tasks; however, they still face limitations in scenarios that demand long-term planning and spatial reasoning. To facilitate this line of research, in this work, we propose a new benchmark, termed ath lanning from atural anguage (). Our benchmark evaluates LLMs' spatial-temporal reasoning by formulating ''path planning'' tasks that require an LLM to navigate to target locations while avoiding obstacles and adhering to constraints. Leveraging this benchmark, we systematically investigate LLMs including GPT-4 via different few-shot prompting methodologies as well as BART and T5 of various sizes via fine-tuning. Our experimental results show the promise of few-shot GPT-4 in spatial reasoning, when it is prompted to reason and act…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsGated Linear Unit · Attention Is All You Need · Dropout · Attention Dropout · Dense Connections · Inverse Square Root Schedule · Linear Layer · Label Smoothing · SentencePiece · Adam
