From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models

Zhuofan Li; Hongkun Yang; Zhenyang Chen; Yangxuan Chen; Yingyan (Celine) Lin; Chaojian Li

arXiv:2603.19131·cs.LG·March 20, 2026

From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models

Zhuofan Li, Hongkun Yang, Zhenyang Chen, Yangxuan Chen, Yingyan (Celine) Lin, Chaojian Li

PDF

Open Access

TL;DR

This paper argues that traditional efficiency metrics for vision-language-action models do not reflect real-world embodied performance, proposing system-level embodied efficiency metrics for better evaluation.

Contribution

It introduces system-level embodied efficiency metrics and demonstrates their importance over conventional metrics in evaluating VLA models.

Findings

01

Conventional metrics can misrepresent real-world efficiency.

02

Embodied efficiency metrics reveal hidden performance differences.

03

Trade-offs exist between computational savings and motion quality.

Abstract

Vision-Language-Action (VLA) models have recently enabled embodied agents to perform increasingly complex tasks by jointly reasoning over visual, linguistic, and motor modalities. However, we find that the prevailing notion of ``efficiency'' in current VLA research, characterized by parameters, FLOPs, or token decoding throughput, does not reflect actual performance on robotic platforms. In real-world execution, efficiency is determined by system-level embodied behaviors such as task completion time, trajectory smoothness, cumulative joint rotation, and motion energy. Through controlled studies across model compression, token sparsification, and action sequence compression, we make several observations that challenge common assumptions. (1) Methods that reduce computation under conventional metrics often increase end-to-end execution cost or degrade motion quality, despite maintaining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Reinforcement Learning in Robotics