Navigating the Labyrinth: Evaluating LLMs' Ability to Reason About Search Problems
Nasim Borazjanizadeh, Roei Herzig, Trevor Darrell, Rogerio Feris, Leonid Karlinsky

TL;DR
This paper introduces SearchBench, a new benchmark for evaluating LLMs on search and reasoning problems, revealing their limitations and proposing a hybrid approach with search algorithms to significantly improve performance.
Contribution
The paper presents SearchBench, a novel benchmark for search reasoning, and demonstrates that combining LLMs with explicit search algorithms greatly enhances problem-solving accuracy.
Findings
GPT-4 solves only 1.4% of problems with step-by-step reasoning.
Prompting models to generate A* search algorithms improves performance.
MSMT inference boosts GPT-4's accuracy to over 57%.
Abstract
Large Language Models (LLMs) have recently achieved impressive performance in math and reasoning benchmarks. However, they often struggle with logic problems and puzzles that are relatively easy for humans. To further investigate this, we introduce a new benchmark, SearchBench, which contains 11 unique search problems inspired by intuitive puzzles. Each SearchBench problem type is equipped with automated pipelines to generate an arbitrary number of instances and analyze the feasibility, correctness, and optimality of LLM-generated solutions. We show that using step-by-step, language-only reasoning, even the most advanced LLMs fail to solve SearchBench; for example, OpenAI's frontier models GPT-4 and o1-preview solve only 1.4% and 18.6% of problems, respectively. The reason is that SearchBench problems require considering multiple pathways and performing backtracking, posing a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOpen Education and E-Learning
