Self-Improving Vision-Language-Action Models with Data Generation via Residual RL
Wenli Xiao, Haotian Lin, Andy Peng, Haoru Xue, Tairan He, Yuqi Xie, Fengyuan Hu, Jimmy Wu, Zhengyi Luo, Linxi "Jim" Fan, Guanya Shi, Yuke Zhu

TL;DR
This paper introduces PLD, a three-stage framework that enhances vision-language-action models through residual reinforcement learning and deployment-aware data collection, significantly improving task success rates.
Contribution
The paper presents a novel residual RL-based framework with distribution-aware data collection for scalable self-improvement of VLA models, surpassing supervised fine-tuning limitations.
Findings
Achieves 99% success on LIBERO tasks.
Over 50% improvement on SimplerEnv.
100% success on real-world manipulation tasks.
Abstract
Supervised fine-tuning (SFT) has become the de facto post-training strategy for large vision-language-action (VLA) models, but its reliance on costly human demonstrations limits scalability and generalization. We propose Probe, Learn, Distill (PLD), a three-stage plug-and-play framework that improves VLAs through residual reinforcement learning (RL) and distribution-aware data collection. In Stage 1, we train lightweight residual actors to probe failure regions of the VLA generalist. In Stage 2, we use a hybrid rollout scheme that aligns collected trajectories with the generalist's deployment distribution while capturing recovery behaviors. In Stage 3, we distill the curated trajectories back into the generalist with standard SFT. PLD achieves near-saturated 99% task success on LIBERO, over 50% gains in SimplerEnv, and 100% success on real-world Franka and YAM arm manipulation tasks.…
Peer Reviews
Decision·ICLR 2026 Poster
1. This paper addresses two core pain points of VLA models in SFT: "data dependence" and "distribution mismatch". PLD combines residual RL with distribution-aware data collection. It avoids the high resource consumption of directly fine-tuning large VLA models and solves the poor generalization issue of pure RL expert data, presenting a novel and logically consistent approach. 2. The effectiveness of PLD is well demonstrated through multi-environment (LIBERO, SimplerEnv, real robots), multi-mod
1. Direct comparisons with recent residual RL methods (e.g., EXPO[1], ResiP[2]) are missing. It is recommended to supplement comparative experiments or discussions on this aspect. 2. Current tasks (e.g., object grasping, peg insertion) are relatively simple. It is suggested to validate the method on more complex long-horizon tasks such as cloth folding. 3. The paper mentions that PLD trains lightweight residual policies but does not discuss costs such as training time and GPU memory usage. It
1. Using residual RL to enrich the data for VLA finetuning is promising. 2. The proposed pipeline is well-designed, which includes base policy probing, warm-start, success classifier, and SFT for policy distillation. 3. The experiments in libero benchmark and real-world RL is comprehensive.
1. Using residual RL to finetune and resolve different gaps is not a pretty novel idea, and this paper didn't disscuss those related work accross different areas in details. For example, residual RL has been used in data efficient learning [1], bridge human2robot embodiment gap [2, 3], real2sim2real transferring for data scaling [4], peg insertion [2, 5], and dexterous manipulations [3, 4]. Disscussing those related work and highlight the contribution might be important. 2. Only binary reward h
- Presents a novel and practical three-stage framework that combines RL and SFT for post-training VLAs without relying on costly human demonstrations. - Effectively addresses scalability challenges in robot learning by improving data diversity (failure recovery trajectories) and reducing dependence on human demonstration data. - Provides robust evidence through extensive experiments on LIBERO, SimplerEnv, real-world Franka setups, and systematic ablation studies. - The paper is well-written a
- The real-world experiments focus on short-horizon tabletop manipulation where the base policy is already strong. It remains unclear how PLD performs when initial success rates are low or when recovery requires multi-step planning. Including a few such cases or a discussion of failure modes would better support claims of scalability. - When applying PLD in the real world, the trade-off between safe operation and sufficient state coverage raises concerns about the method’s ability to handle mor
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Reinforcement Learning in Robotics
