Run-time Observation Interventions Make Vision-Language-Action Models   More Visually Robust

Asher J. Hancock; Allen Z. Ren; Anirudha Majumdar

arXiv:2410.01971·cs.RO·October 4, 2024

Run-time Observation Interventions Make Vision-Language-Action Models More Visually Robust

Asher J. Hancock, Allen Z. Ren, Anirudha Majumdar

PDF

Open Access

TL;DR

This paper introduces BYOVLA, a run-time intervention method that enhances the visual robustness of vision-language-action models by dynamically editing input images to mitigate distractor effects without needing model fine-tuning.

Contribution

BYOVLA is a novel run-time intervention scheme that improves VLA model robustness to visual distractors through automated image editing, compatible with existing models without retraining.

Findings

01

BYOVLA maintains near-nominal performance despite distractors.

02

It reduces task failure rates caused by distractors by up to 40%.

03

The method works with off-the-shelf VLA models without fine-tuning.

Abstract

Vision-language-action (VLA) models trained on large-scale internet data and robot demonstrations have the potential to serve as generalist robot policies. However, despite their large-scale training, VLAs are often brittle to task-irrelevant visual details such as distractor objects or background colors. We introduce Bring Your Own VLA (BYOVLA): a run-time intervention scheme that (1) dynamically identifies regions of the input image that the model is sensitive to, and (2) minimally alters task-irrelevant regions to reduce the model's sensitivity using automated image editing tools. Our approach is compatible with any off the shelf VLA without model fine-tuning or access to the model's weights. Hardware experiments on language-instructed manipulation tasks demonstrate that BYOVLA enables state-of-the-art VLA models to nearly retain their nominal performance in the presence of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications