LLM-BABYBENCH: Understanding and Evaluating Grounded Planning and Reasoning in LLMs

Omar Choukrani; Idriss Malek; Daniil Orel; Zhuohan Xie; Zangir Iklassov; Martin Tak\'a\v{c}; Salem Lahlou

arXiv:2505.12135·cs.AI·May 20, 2025

LLM-BABYBENCH: Understanding and Evaluating Grounded Planning and Reasoning in LLMs

Omar Choukrani, Idriss Malek, Daniil Orel, Zhuohan Xie, Zangir Iklassov, Martin Tak\'a\v{c}, Salem Lahlou

PDF

Open Access 1 Repo

TL;DR

LLM-BabyBench is a new benchmark suite designed to evaluate large language models on grounded planning and reasoning tasks within a text-based environment, addressing key aspects of grounded intelligence.

Contribution

This paper introduces LLM-BabyBench, a comprehensive benchmark with datasets, evaluation metrics, and code for assessing LLMs' grounded reasoning in interactive environments, which was previously lacking.

Findings

01

Baseline results reveal significant challenges for LLMs in grounded reasoning tasks.

02

The benchmark provides a standardized framework for reproducible evaluation.

03

Datasets and tools are publicly available for the research community.

Abstract

Assessing the capacity of Large Language Models (LLMs) to plan and reason within the constraints of interactive environments is crucial for developing capable AI agents. We introduce $LLM-BabyBench$ , a new benchmark suite designed specifically for this purpose. Built upon a textual adaptation of the procedurally generated BabyAI grid world, this suite evaluates LLMs on three fundamental aspects of grounded intelligence: (1) predicting the consequences of actions on the environment state ( $Predict$ task), (2) generating sequences of low-level actions to achieve specified objectives ( $Plan$ task), and (3) decomposing high-level instructions into coherent subgoal sequences ( $Decompose$ task). We detail the methodology for generating the three corresponding datasets ( $LLM-BabyBench-Predict$ , $-Plan$ , $-Decompose$ ) by extracting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

choukrani/llm-babybench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · AI-based Problem Solving and Planning