Run-time Observation Interventions Make Vision-Language-Action Models More Visually Robust
Asher J. Hancock, Allen Z. Ren, Anirudha Majumdar

TL;DR
This paper introduces BYOVLA, a run-time intervention method that enhances the visual robustness of vision-language-action models by dynamically editing input images to mitigate distractor effects without needing model fine-tuning.
Contribution
BYOVLA is a novel run-time intervention scheme that improves VLA model robustness to visual distractors through automated image editing, compatible with existing models without retraining.
Findings
BYOVLA maintains near-nominal performance despite distractors.
It reduces task failure rates caused by distractors by up to 40%.
The method works with off-the-shelf VLA models without fine-tuning.
Abstract
Vision-language-action (VLA) models trained on large-scale internet data and robot demonstrations have the potential to serve as generalist robot policies. However, despite their large-scale training, VLAs are often brittle to task-irrelevant visual details such as distractor objects or background colors. We introduce Bring Your Own VLA (BYOVLA): a run-time intervention scheme that (1) dynamically identifies regions of the input image that the model is sensitive to, and (2) minimally alters task-irrelevant regions to reduce the model's sensitivity using automated image editing tools. Our approach is compatible with any off the shelf VLA without model fine-tuning or access to the model's weights. Hardware experiments on language-instructed manipulation tasks demonstrate that BYOVLA enables state-of-the-art VLA models to nearly retain their nominal performance in the presence of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
