Butter-Bench: Evaluating LLM Controlled Robots for Practical Intelligence

Callum Sharrock; Lukas Petersson; Hanna Petersson; Axel Backlund; Axel Wennstr\"om; Kristoffer Nordstr\"om; Elias Aronsson

arXiv:2510.21860·cs.RO·October 28, 2025

Butter-Bench: Evaluating LLM Controlled Robots for Practical Intelligence

Callum Sharrock, Lukas Petersson, Hanna Petersson, Axel Backlund, Axel Wennstr\"om, Kristoffer Nordstr\"om, Elias Aronsson

PDF

TL;DR

Butter-Bench is a benchmark designed to evaluate the practical intelligence of LLM-controlled robots in real-world scenarios, revealing current limitations in multi-step planning and social understanding.

Contribution

This paper introduces Butter-Bench, a new benchmark for assessing LLMs in robotic tasks, focusing on their reasoning abilities separate from low-level control.

Findings

01

LLMs score 40% on Butter-Bench

02

Humans score 95% on Butter-Bench

03

Fine-tuning LLMs for embodied reasoning does not improve scores

Abstract

We present Butter-Bench, a benchmark evaluating large language model (LLM) controlled robots for practical intelligence, defined as the ability to navigate the messiness of the physical world. Current state-of-the-art robotic systems use a hierarchical architecture with LLMs in charge of high-level reasoning, and a Vision Language Action (VLA) model for low-level control. Butter-Bench evaluates the LLM part in isolation from the VLA. Although LLMs have repeatedly surpassed humans in evaluations requiring analytical intelligence, we find humans still outperform LLMs on Butter-Bench. The best LLMs score 40% on Butter-Bench, while the mean human score is 95%. LLMs struggled the most with multi-step spatial planning and social understanding. We also evaluate LLMs that are fine-tuned for embodied reasoning and conclude that this training does not improve their score on Butter-Bench.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.