RADAR: Benchmarking Vision-Language-Action Generalization via Real-World Dynamics, Spatial-Physical Intelligence, and Autonomous Evaluation

Yuhao Chen; Zhihao Zhan; Xiaoxin Lin; Zijian Song; Hao Liu; Qinhan Lyu; Yubo Zu; Xiao Chen; Zhiyuan Liu; Tao Pu; Tianshui Chen; Keze Wang; Liang Lin; Guangrun Wang

arXiv:2602.10980·cs.RO·February 12, 2026

RADAR: Benchmarking Vision-Language-Action Generalization via Real-World Dynamics, Spatial-Physical Intelligence, and Autonomous Evaluation

Yuhao Chen, Zhihao Zhan, Xiaoxin Lin, Zijian Song, Hao Liu, Qinhan Lyu, Yubo Zu, Xiao Chen, Zhiyuan Liu, Tao Pu, Tianshui Chen, Keze Wang, Liang Lin, Guangrun Wang

PDF

Open Access

TL;DR

RADAR is a comprehensive benchmark that evaluates vision-language-action models in realistic, dynamic, and autonomous settings, revealing significant gaps in current model generalization and reasoning abilities.

Contribution

This paper introduces RADAR, a novel benchmark with real-world dynamics, spatial reasoning tasks, and autonomous evaluation to better assess VLA models' real-world generalization.

Findings

01

Performance drops significantly under physical dynamics and sensor noise.

02

Models show limited spatial reasoning capabilities.

03

RADAR uncovers fragility in state-of-the-art VLA models.

Abstract

VLA models have achieved remarkable progress in embodied intelligence; however, their evaluation remains largely confined to simulations or highly constrained real-world settings. This mismatch creates a substantial reality gap, where strong benchmark performance often masks poor generalization in diverse physical environments. We identify three systemic shortcomings in current benchmarking practices that hinder fair and reliable model comparison. (1) Existing benchmarks fail to model real-world dynamics, overlooking critical factors such as dynamic object configurations, robot initial states, lighting changes, and sensor noise. (2) Current protocols neglect spatial--physical intelligence, reducing evaluation to rote manipulation tasks that do not probe geometric reasoning. (3) The field lacks scalable fully autonomous evaluation, instead relying on simplistic 2D metrics that miss 3D…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Domain Adaptation and Few-Shot Learning