Do What? Teaching Vision-Language-Action Models to Reject the Impossible
Wen-Han Hsieh, Elvis Hsieh, Dantong Niu, Trevor Darrell, Roei Herzig, David M. Chan

TL;DR
This paper introduces IVA, a framework for vision-language-action models to detect, clarify, and respond to false-premise instructions in robotic tasks, improving robustness and understanding of user intent.
Contribution
The paper presents a novel unified framework, Instruct-Verify-and-Act (IVA), for detecting false-premise instructions and engaging in clarification, a significant advancement in VLA model robustness.
Findings
IVA improves false premise detection accuracy by 97.56%.
IVA increases successful responses in false-premise scenarios by 50.78%.
Constructed a large-scale dataset for training and evaluation.
Abstract
Recently, Vision-Language-Action (VLA) models have demonstrated strong performance on a range of robotic tasks. These models rely on multimodal inputs, with language instructions playing a crucial role -- not only in predicting actions, but also in robustly interpreting user intent, even when the requests are impossible to fulfill. In this work, we investigate how VLAs can recognize, interpret, and respond to false-premise instructions: natural language commands that reference objects or conditions absent from the environment. We propose Instruct-Verify-and-Act (IVA), a unified framework that (i) detects when an instruction cannot be executed due to a false premise, (ii) engages in language-based clarification or correction, and (iii) grounds plausible alternatives in perception and action. Towards this end, we construct a large-scale instruction tuning setup with structured language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
