STRONG-VLA: Decoupled Robustness Learning for Vision-Language-Action Models under Multimodal Perturbations
Yuhan Xie, Yuping Yan, Yunqi Zhao, Handing Wang, Yaochu Jin

TL;DR
STRONG-VLA introduces a decoupled fine-tuning framework that enhances robustness of vision-language-action models against multimodal perturbations, improving task success rates without sacrificing execution fidelity.
Contribution
It proposes a two-stage decoupled training approach and establishes a comprehensive benchmark for robustness under diverse multimodal perturbations.
Findings
Significant improvements in task success rates across multiple architectures.
Up to 16.49% gains on benchmark datasets under various perturbations.
Effective real-world validation on a robotic platform.
Abstract
Despite their strong performance in embodied tasks, recent Vision-Language-Action (VLA) models remain highly fragile under multimodal perturbations, where visual corruption and linguistic noise jointly induce distribution shifts that degrade task-level execution. Existing robustness approaches typically rely on joint training with perturbed data, treating robustness as a static objective, which leads to conflicting optimization between robustness and task fidelity. In this work, we propose STRONG-VLA, a decoupled fine-tuning framework that explicitly separates robustness acquisition from task-aligned refinement. In Stage I, the model is exposed to a curriculum of multimodal perturbations with increasing difficulty, enabling progressive robustness learning under controlled distribution shifts. In Stage II, the model is re-aligned with clean task distributions to recover execution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
