Navigating the Labyrinth: Evaluating LLMs' Ability to Reason About Search Problems

Nasim Borazjanizadeh; Roei Herzig; Trevor Darrell; Rogerio Feris; Leonid Karlinsky

arXiv:2406.12172·cs.AI·September 16, 2025·1 cites

Navigating the Labyrinth: Evaluating LLMs' Ability to Reason About Search Problems

Nasim Borazjanizadeh, Roei Herzig, Trevor Darrell, Rogerio Feris, Leonid Karlinsky

PDF

Open Access

TL;DR

This paper introduces SearchBench, a new benchmark for evaluating LLMs on search and reasoning problems, revealing their limitations and proposing a hybrid approach with search algorithms to significantly improve performance.

Contribution

The paper presents SearchBench, a novel benchmark for search reasoning, and demonstrates that combining LLMs with explicit search algorithms greatly enhances problem-solving accuracy.

Findings

01

GPT-4 solves only 1.4% of problems with step-by-step reasoning.

02

Prompting models to generate A* search algorithms improves performance.

03

MSMT inference boosts GPT-4's accuracy to over 57%.

Abstract

Large Language Models (LLMs) have recently achieved impressive performance in math and reasoning benchmarks. However, they often struggle with logic problems and puzzles that are relatively easy for humans. To further investigate this, we introduce a new benchmark, SearchBench, which contains 11 unique search problems inspired by intuitive puzzles. Each SearchBench problem type is equipped with automated pipelines to generate an arbitrary number of instances and analyze the feasibility, correctness, and optimality of LLM-generated solutions. We show that using step-by-step, language-only reasoning, even the most advanced LLMs fail to solve SearchBench; for example, OpenAI's frontier models GPT-4 and o1-preview solve only 1.4% and 18.6% of problems, respectively. The reason is that SearchBench problems require considering multiple pathways and performing backtracking, posing a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOpen Education and E-Learning