What to Ignore, What to React: Visually Robust RL Fine-Tuning of VLA Models

Yuanfang Peng; Jingjing Fu; Chuheng Zhang; Li Zhao; Jiang Bian; Mingyu Liu; Ling Zhang; Jun Zhang; Rui Wang

arXiv:2605.13105·cs.RO·May 14, 2026

What to Ignore, What to React: Visually Robust RL Fine-Tuning of VLA Models

Yuanfang Peng, Jingjing Fu, Chuheng Zhang, Li Zhao, Jiang Bian, Mingyu Liu, Ling Zhang, Jun Zhang, Rui Wang

PDF

TL;DR

This paper introduces PAIR-VLA, a reinforcement learning fine-tuning framework that enhances visual robustness of VLA models in robotic manipulation by using paired visual variants to guide policy responses.

Contribution

The authors propose a novel RL fine-tuning method with auxiliary objectives that improve policy robustness to visual shifts by leveraging paired visual variants during training.

Findings

01

Consistently improves over standard PPO in diverse visual shift scenarios.

02

Achieves 16.62% and 9.10% average improvements on two VLA architectures.

03

Demonstrates transferability of invariance and sensitivity guidance across different visual shifts.

Abstract

Reinforcement learning (RL) fine-tuning has shown promise for Vision-Language-Action (VLA) models in robotic manipulation, but deployment-time visual shifts pose practical challenges. A key difficulty is that standard task rewards supervise task success, but offer limited guidance on whether a visual change is task-irrelevant or changes the behavior required for manipulation. We propose PAIR-VLA (Paired Action Invariance & Sensitivity for Visually Robust VLA), an RL fine-tuning framework to address this difficulty by adding two auxiliary objectives over paired visual variants during PPO optimization: an invariance term that reduces the discrepancy between action distributions for a task-preserving pair (e.g., different distractors), and a sensitivity objective that encourages separable action distributions for a task-altering pair (e.g., target object in a different pose). Together,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.