TL;DR
This study systematically evaluates how reinforcement learning, especially PPO, improves the generalization of large vision-language action models across various tasks and dimensions, outperforming supervised fine-tuning.
Contribution
It introduces a comprehensive benchmark for VLA generalization and demonstrates that RL fine-tuning, particularly with PPO, enhances semantic understanding and robustness over supervised methods.
Findings
RL fine-tuning with PPO improves semantic understanding.
PPO outperforms DPO and GRPO for VLAs.
A simple PPO training recipe enhances VLA generalization.
Abstract
Large Vision-Language Action (VLA) models have shown significant potential for embodied AI. However, their predominant training via supervised fine-tuning (SFT) limits generalization due to susceptibility to compounding errors under distribution shifts. Reinforcement learning (RL) offers a path to overcome these limitations by optimizing for task objectives via trial-and-error, yet a systematic understanding of its specific generalization benefits for VLAs compared to SFT is lacking. To address this, our study introduces a comprehensive benchmark for evaluating VLA generalization and systematically investigates the impact of RL fine-tuning across diverse visual, semantic, and execution dimensions. Our extensive experiments reveal that RL fine-tuning, particularly with PPO, significantly enhances generalization in semantic understanding and execution robustness over SFT, while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
MethodsShrink and Fine-Tune · Entropy Regularization · Proximal Policy Optimization · Direct Preference Optimization
