STRONG-VLA: Decoupled Robustness Learning for Vision-Language-Action Models under Multimodal Perturbations

Yuhan Xie; Yuping Yan; Yunqi Zhao; Handing Wang; Yaochu Jin

arXiv:2604.10055·cs.RO·April 15, 2026

STRONG-VLA: Decoupled Robustness Learning for Vision-Language-Action Models under Multimodal Perturbations

Yuhan Xie, Yuping Yan, Yunqi Zhao, Handing Wang, Yaochu Jin

PDF

TL;DR

STRONG-VLA introduces a decoupled fine-tuning framework that enhances robustness of vision-language-action models against multimodal perturbations, improving task success rates without sacrificing execution fidelity.

Contribution

It proposes a two-stage decoupled training approach and establishes a comprehensive benchmark for robustness under diverse multimodal perturbations.

Findings

01

Significant improvements in task success rates across multiple architectures.

02

Up to 16.49% gains on benchmark datasets under various perturbations.

03

Effective real-world validation on a robotic platform.

Abstract

Despite their strong performance in embodied tasks, recent Vision-Language-Action (VLA) models remain highly fragile under multimodal perturbations, where visual corruption and linguistic noise jointly induce distribution shifts that degrade task-level execution. Existing robustness approaches typically rely on joint training with perturbed data, treating robustness as a static objective, which leads to conflicting optimization between robustness and task fidelity. In this work, we propose STRONG-VLA, a decoupled fine-tuning framework that explicitly separates robustness acquisition from task-aligned refinement. In Stage I, the model is exposed to a curriculum of multimodal perturbations with increasing difficulty, enabling progressive robustness learning under controlled distribution shifts. In Stage II, the model is re-aligned with clean task distributions to recover execution…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.