StageCraft: Execution Aware Mitigation of Distractor and Obstruction Failures in VLA Models
Kartikay Milind Pangaonkar, Prabin Rath, Omkar Patil, Nakul Gopalan

TL;DR
StageCraft is a training-free, VLM-based method that improves vision-language action models by manipulating initial environments to mitigate failures caused by distractors and obstructions, significantly enhancing real-world task performance.
Contribution
We introduce StageCraft, a novel environment manipulation approach leveraging large vision-language models to improve policy robustness without additional training.
Findings
40% performance improvement in real-world tasks
Effective in diverse distractor and obstruction scenarios
Adapts intervention based on policy strength
Abstract
Large scale pre-training on text and image data along with diverse robot demonstrations has helped Vision Language Action models (VLAs) to generalize to novel tasks, objects and scenes. However, these models are still susceptible to failure in the presence of execution-time impediments such as distractors and physical obstructions in the robot's workspace. Existing policy improvement methods finetune base VLAs to improve generalization, yet they still struggle in unseen distractor settings. To address this problem, we investigate whether internet-scale pretraining of large vision-language models (VLMs) can be leveraged to reason about these impediments and mitigate policy failures. To this end, we propose StageCraft, a training-free approach to improve pretrained VLA policy performance by manipulating the environment's initial state using VLM-based in-context reasoning. StageCraft takes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)
