How Foundational Skills Influence VLM-based Embodied Agents:A Native Perspective

Bo Peng; Pi Bu; Keyu Pan; Xinrun Xu; Yinxiu Zhao; Miao Chen; Yang Du; Lin Li; Jun Song; Tong Xu

arXiv:2602.20687·cs.AI·February 25, 2026

How Foundational Skills Influence VLM-based Embodied Agents:A Native Perspective

Bo Peng, Pi Bu, Keyu Pan, Xinrun Xu, Yinxiu Zhao, Miao Chen, Yang Du, Lin Li, Jun Song, Tong Xu

PDF

Open Access

TL;DR

This paper introduces NativeEmbodied, a comprehensive benchmark with native low-level actions for evaluating VLM-driven embodied agents across multiple tasks and skills, revealing fundamental limitations in current models.

Contribution

It presents NativeEmbodied, a novel benchmark that evaluates embodied agents at both low and high levels using a unified native action space, enabling detailed performance analysis.

Findings

01

State-of-the-art VLMs show deficiencies in fundamental embodied skills.

02

Fundamental skill bottlenecks limit high-level task performance.

03

NativeEmbodied provides insights to improve future embodied agent research.

Abstract

Recent advances in vision-language models (VLMs) have shown promise for human-level embodied intelligence. However, existing benchmarks for VLM-driven embodied agents often rely on high-level commands or discretized action spaces, which are non-native settings that differ markedly from real-world control. In addition, current benchmarks focus primarily on high-level tasks and lack joint evaluation and analysis at both low and high levels. To address these limitations, we present NativeEmbodied, a challenging benchmark for VLM-driven embodied agents that uses a unified, native low-level action space. Built on diverse simulated scenes, NativeEmbodied includes three representative high-level tasks in complex scenarios to evaluate overall performance. For more detailed analysis, we further decouple the skills required by complex tasks and construct four types of low-level tasks, each…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Action Observation and Synchronization