Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces

Gen Luo; Ganlin Yang; Ziyang Gong; Guanzhou Chen; Haonan Duan; Erfei Cui; Ronglei Tong; Zhi Hou; Tianyi Zhang; Zhe Chen; Shenglong Ye; Lewei Lu; Jingbo Wang; Wenhai Wang; Jifeng Dai; Yu Qiao; Rongrong Ji; Xizhou Zhu

arXiv:2506.00123·cs.CV·June 3, 2025

Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces

Gen Luo, Ganlin Yang, Ziyang Gong, Guanzhou Chen, Haonan Duan, Erfei Cui, Ronglei Tong, Zhi Hou, Tianyi Zhang, Zhe Chen, Shenglong Ye, Lewei Lu, Jingbo Wang, Wenhai Wang, Jifeng Dai, Yu Qiao, Rongrong Ji, Xizhou Zhu

PDF

Open Access

TL;DR

The paper introduces VeBrain, a unified multimodal framework that enables robots to perceive, reason, and control through a text-based approach, demonstrating superior performance and adaptability in real-world robotic tasks.

Contribution

VeBrain unifies multimodal understanding, reasoning, and control for robots by reformulating control as text-based tasks and introducing a new dataset and robotic adapter.

Findings

01

VeBrain outperforms existing MLLMs on multiple benchmarks.

02

Significant performance gains in robotic control tasks, especially in legged robots.

03

Demonstrates strong adaptability and flexibility in real-world robotic applications.

Abstract

The remarkable progress of Multimodal Large Language Models (MLLMs) has attracted increasing attention to extend them to physical entities like legged robot. This typically requires MLLMs to not only grasp multimodal understanding abilities, but also integrate visual-spatial reasoning and physical interaction capabilities. Nevertheless,existing methods struggle to unify these capabilities due to their fundamental differences.In this paper, we present the Visual Embodied Brain (VeBrain), a unified framework for perception, reasoning, and control in real world. VeBrain reformulates robotic control into common text-based MLLM tasks in the 2D visual space, thus unifying the objectives and mapping spaces of different tasks. Then, a novel robotic adapter is proposed to convert textual control signals from MLLMs to motion policies of real robots. From the data perspective, we further introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotics and Automated Systems · Speech and dialogue systems

MethodsSoftmax · Attention Is All You Need · ADaptive gradient method with the OPTimal convergence rate · Adapter