Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models
Haiweng Xu, Sipeng Zheng, Hao Luo, Wanpeng Zhang, Ziheng Xi, Zongqing Lu

TL;DR
This paper introduces BeTTER, a diagnostic benchmark revealing that current vision-language-action models fail in dynamic scenarios due to architectural bottlenecks, despite high success rates on standard benchmarks.
Contribution
The paper presents BeTTER, a novel diagnostic benchmark that uncovers fundamental architectural limitations in current VLA models, emphasizing the need for improved reasoning capabilities.
Findings
State-of-the-art VLAs fail catastrophically in dynamic scenarios.
Architectural bottlenecks degrade semantic representations.
Static evaluation masks underlying model deficiencies.
Abstract
Recent Vision-Language-Action (VLA) models report impressive success rates on standard robotic benchmarks, fueling optimism about general-purpose physical intelligence. However, recent evidence suggests a systematic misalignment between standard benchmark success and true embodied reasoning, raising the question of whether these high scores reflect genuine cognitive capability. To address this gap, we introduce BeTTER, a diagnostic Benchmark for Testing True Embodied Reasoning in robotic policies. BeTTER applies targeted causal interventions (e.g., spatial layout shifts, temporal extrapolation) while enforcing kinematic isolation to explicitly decouple high-level reasoning failures from low-level execution limits. Through systematic evaluation, we reveal that state-of-the-art VLAs catastrophically fail in dynamic scenarios, exhibiting severe lexical-kinematic shortcuts, behavioral…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
