Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models

Haiweng Xu; Sipeng Zheng; Hao Luo; Wanpeng Zhang; Ziheng Xi; Zongqing Lu

arXiv:2604.18000·cs.RO·April 21, 2026

Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models

Haiweng Xu, Sipeng Zheng, Hao Luo, Wanpeng Zhang, Ziheng Xi, Zongqing Lu

PDF

TL;DR

This paper introduces BeTTER, a diagnostic benchmark revealing that current vision-language-action models fail in dynamic scenarios due to architectural bottlenecks, despite high success rates on standard benchmarks.

Contribution

The paper presents BeTTER, a novel diagnostic benchmark that uncovers fundamental architectural limitations in current VLA models, emphasizing the need for improved reasoning capabilities.

Findings

01

State-of-the-art VLAs fail catastrophically in dynamic scenarios.

02

Architectural bottlenecks degrade semantic representations.

03

Static evaluation masks underlying model deficiencies.

Abstract

Recent Vision-Language-Action (VLA) models report impressive success rates on standard robotic benchmarks, fueling optimism about general-purpose physical intelligence. However, recent evidence suggests a systematic misalignment between standard benchmark success and true embodied reasoning, raising the question of whether these high scores reflect genuine cognitive capability. To address this gap, we introduce BeTTER, a diagnostic Benchmark for Testing True Embodied Reasoning in robotic policies. BeTTER applies targeted causal interventions (e.g., spatial layout shifts, temporal extrapolation) while enforcing kinematic isolation to explicitly decouple high-level reasoning failures from low-level execution limits. Through systematic evaluation, we reveal that state-of-the-art VLAs catastrophically fail in dynamic scenarios, exhibiting severe lexical-kinematic shortcuts, behavioral…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.