MMRo: Are Multimodal LLMs Eligible as the Brain for In-Home Robotics?
Jinming Li, Yichen Zhu, Zhiyuan Xu, Jindong Gu, Minjie Zhu, Xin Liu,, Ning Liu, Yaxin Peng, Feifei Feng, Jian Tang

TL;DR
This paper introduces the first benchmark to evaluate whether multimodal large language models can serve as the brain of in-home robots, assessing their perception, planning, reasoning, and safety capabilities.
Contribution
It presents a comprehensive benchmark with 14 metrics for evaluating MLLMs in robotic contexts and provides experimental results showing current models' limitations.
Findings
No single MLLM excels in all evaluated areas.
Current MLLMs are not yet reliable enough to be the robot's central processor.
The benchmark highlights key areas for improvement in multimodal LLMs for robotics.
Abstract
It is fundamentally challenging for robots to serve as useful assistants in human environments because this requires addressing a spectrum of sub-problems across robotics, including perception, language understanding, reasoning, and planning. The recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated their exceptional abilities in solving complex mathematical problems, mastering commonsense and abstract reasoning. This has led to the recent utilization of MLLMs as the brain in robotic systems, enabling these models to conduct high-level planning prior to triggering low-level control actions for task execution. However, it remains uncertain whether existing MLLMs are reliable in serving the brain role of robots. In this study, we introduce the first benchmark for evaluating Multimodal LLM for Robotic (MMRo) benchmark, which tests the capability of MLLMs for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Automated Systems · Speech and dialogue systems · AI in Service Interactions
