Improving Robustness of Vision-Language-Action Models by Restoring Corrupted Visual Inputs

Daniel Yezid Guarnizo Orjuela; Leonardo Scappatura; Veronica Di Gennaro; Riccardo Andrea Izzo; Gianluca Bardaro; Matteo Matteucci

arXiv:2602.01158·cs.CV·February 3, 2026

Improving Robustness of Vision-Language-Action Models by Restoring Corrupted Visual Inputs

Daniel Yezid Guarnizo Orjuela, Leonardo Scappatura, Veronica Di Gennaro, Riccardo Andrea Izzo, Gianluca Bardaro, Matteo Matteucci

PDF

Open Access

TL;DR

This paper identifies the vulnerability of vision-language-action models to visual corruptions and introduces CRT, a transformer-based restoration method, to improve their robustness without retraining the entire model.

Contribution

The paper proposes CRT, a plug-and-play vision transformer that restores corrupted visual inputs, significantly enhancing VLA models' robustness against sensor artifacts.

Findings

01

CRT restores visual inputs effectively under severe corruption.

02

VLA models maintain near-baseline success rates with CRT.

03

CRT does not require fine-tuning of the original models.

Abstract

Vision-Language-Action (VLA) models have emerged as a dominant paradigm for generalist robotic manipulation, unifying perception and control within a single end-to-end architecture. However, despite their success in controlled environments, reliable real-world deployment is severely hindered by their fragility to visual disturbances. While existing literature extensively addresses physical occlusions caused by scene geometry, a critical mode remains largely unexplored: image corruptions. These sensor-level artifacts, ranging from electronic noise and dead pixels to lens contaminants, directly compromise the integrity of the visual signal prior to interpretation. In this work, we quantify this vulnerability, demonstrating that state-of-the-art VLAs such as $π_{0.5}$ and SmolVLA, suffer catastrophic performance degradation, dropping from 90\% success rates to as low as 2\%, under common…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Multimodal Machine Learning Applications · Robot Manipulation and Learning