Evaluation of Habitat Robotics using Large Language Models
William Li, Lei Hamilton, Kaise Al-natour, Sanjeev Mohindra

TL;DR
This study evaluates the performance of various Large Language Models in robotic tasks within simulated kitchen environments, highlighting the superior reasoning capabilities of models like OpenAI o3-mini over others like GPT-4o and Llama 3.
Contribution
It introduces the Meta PARTNER benchmark for assessing LLMs in embodied robotic tasks and demonstrates the effectiveness of reasoning models in such environments.
Findings
o3-mini outperforms GPT-4o and Llama 3 in robotic tasks
Reasoning models excel in both observable and partially observable environments
Results suggest promising directions for embodied robotic development
Abstract
This paper focuses on evaluating the effectiveness of Large Language Models at solving embodied robotic tasks using the Meta PARTNER benchmark. Meta PARTNR provides simplified environments and robotic interactions within randomized indoor kitchen scenes. Each randomized kitchen scene is given a task where two robotic agents cooperatively work together to solve the task. We evaluated multiple frontier models on Meta PARTNER environments. Our results indicate that reasoning models like OpenAI o3-mini outperform non-reasoning models like OpenAI GPT-4o and Llama 3 when operating in PARTNR's robotic embodied environments. o3-mini displayed outperform across centralized, decentralized, full observability, and partial observability configurations. This provides a promising avenue of research for embodied robotic development.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Robot Manipulation and Learning
MethodsLLaMA
