LLM-BABYBENCH: Understanding and Evaluating Grounded Planning and Reasoning in LLMs
Omar Choukrani, Idriss Malek, Daniil Orel, Zhuohan Xie, Zangir Iklassov, Martin Tak\'a\v{c}, Salem Lahlou

TL;DR
LLM-BabyBench is a new benchmark suite designed to evaluate large language models on grounded planning and reasoning tasks within a text-based environment, addressing key aspects of grounded intelligence.
Contribution
This paper introduces LLM-BabyBench, a comprehensive benchmark with datasets, evaluation metrics, and code for assessing LLMs' grounded reasoning in interactive environments, which was previously lacking.
Findings
Baseline results reveal significant challenges for LLMs in grounded reasoning tasks.
The benchmark provides a standardized framework for reproducible evaluation.
Datasets and tools are publicly available for the research community.
Abstract
Assessing the capacity of Large Language Models (LLMs) to plan and reason within the constraints of interactive environments is crucial for developing capable AI agents. We introduce , a new benchmark suite designed specifically for this purpose. Built upon a textual adaptation of the procedurally generated BabyAI grid world, this suite evaluates LLMs on three fundamental aspects of grounded intelligence: (1) predicting the consequences of actions on the environment state ( task), (2) generating sequences of low-level actions to achieve specified objectives ( task), and (3) decomposing high-level instructions into coherent subgoal sequences ( task). We detail the methodology for generating the three corresponding datasets (, , ) by extracting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · AI-based Problem Solving and Planning
