TL;DR
This paper introduces RoboAbstention, a framework for benchmarking abstention in embodied robotic agents using visual grounding and instruction generation, revealing significant weaknesses in current vision-language models.
Contribution
It presents a novel taxonomy and dataset for evaluating abstention in embodied robotics, along with methods to improve abstention performance.
Findings
All evaluated models show weaknesses in abstention, with the best at 39%.
Interventions like prompting and in-context learning significantly improve abstention rates.
No current approach fully addresses the abstention challenge in embodied robotic agents.
Abstract
Vision-language models (VLMs) are used as high-level planners for embodied agents, translating natural language instructions and visual observations into action plans. While prior work has studied abstention in LLMs, existing benchmarks are largely text-only and do not capture the perceptual grounding and physical constraints inherent to embodied robotics environments. In such settings, abstention requires recognizing when instructions are ambiguous, physically infeasible, based on false premises, or otherwise unresolvable given the available sensory modalities and context. To address this gap, we introduce a taxonomy to categorize abstention in the context of embodied robotics and present RoboAbstention, a scalable and auditable framework for generating abstention instructions grounded in images gathered from five robotics datasets. RoboAbstention instantiates the taxonomy through a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
