How Foundational Skills Influence VLM-based Embodied Agents:A Native Perspective
Bo Peng, Pi Bu, Keyu Pan, Xinrun Xu, Yinxiu Zhao, Miao Chen, Yang Du, Lin Li, Jun Song, Tong Xu

TL;DR
This paper introduces NativeEmbodied, a comprehensive benchmark with native low-level actions for evaluating VLM-driven embodied agents across multiple tasks and skills, revealing fundamental limitations in current models.
Contribution
It presents NativeEmbodied, a novel benchmark that evaluates embodied agents at both low and high levels using a unified native action space, enabling detailed performance analysis.
Findings
State-of-the-art VLMs show deficiencies in fundamental embodied skills.
Fundamental skill bottlenecks limit high-level task performance.
NativeEmbodied provides insights to improve future embodied agent research.
Abstract
Recent advances in vision-language models (VLMs) have shown promise for human-level embodied intelligence. However, existing benchmarks for VLM-driven embodied agents often rely on high-level commands or discretized action spaces, which are non-native settings that differ markedly from real-world control. In addition, current benchmarks focus primarily on high-level tasks and lack joint evaluation and analysis at both low and high levels. To address these limitations, we present NativeEmbodied, a challenging benchmark for VLM-driven embodied agents that uses a unified, native low-level action space. Built on diverse simulated scenes, NativeEmbodied includes three representative high-level tasks in complex scenarios to evaluate overall performance. For more detailed analysis, we further decouple the skills required by complex tasks and construct four types of low-level tasks, each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Action Observation and Synchronization
