RationalVLA: A Rational Vision-Language-Action Model with Dual System

Wenxuan Song; Jiayi Chen; Wenxue Li; Xu He; Han Zhao; Can Cui; Pengxiang Ding Shiyan Su; Feilong Tang; Xuelian Cheng; Donglin Wang; Zongyuan Ge; Xinhu Zheng; Zhe Liu; Hesheng Wang; Haoang Li

arXiv:2506.10826·cs.RO·June 16, 2025

RationalVLA: A Rational Vision-Language-Action Model with Dual System

Wenxuan Song, Jiayi Chen, Wenxue Li, Xu He, Han Zhao, Can Cui, Pengxiang Ding Shiyan Su, Feilong Tang, Xuelian Cheng, Donglin Wang, Zongyuan Ge, Xinhu Zheng, Zhe Liu, Hesheng Wang, Haoang Li

PDF

Open Access

TL;DR

RationalVLA is a dual-system model that improves robotic manipulation by understanding, reasoning, and rejecting infeasible natural language instructions, demonstrated on a new challenging benchmark with diverse defective commands.

Contribution

The paper introduces RAMA, a new benchmark with over 14,000 samples of defective instructions, and proposes RationalVLA, a dual vision-language-action model that effectively handles ambiguous and infeasible commands.

Findings

01

RationalVLA achieves 14.5% higher success rate on RAMA.

02

It effectively rejects infeasible instructions.

03

It maintains competitive performance on standard tasks.

Abstract

A fundamental requirement for real-world robotic deployment is the ability to understand and respond to natural language instructions. Existing language-conditioned manipulation tasks typically assume that instructions are perfectly aligned with the environment. This assumption limits robustness and generalization in realistic scenarios where instructions may be ambiguous, irrelevant, or infeasible. To address this problem, we introduce RAtional MAnipulation (RAMA), a new benchmark that challenges models with both unseen executable instructions and defective ones that should be rejected. In RAMA, we construct a dataset with over 14,000 samples, including diverse defective instructions spanning six dimensions: visual, physical, semantic, motion, safety, and out-of-context. We further propose the Rational Vision-Language-Action model (RationalVLA). It is a dual system for robotic arms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Cognitive Science and Mapping · Multi-Agent Systems and Negotiation