VLA-OPD: Bridging Offline SFT and Online RL for Vision-Language-Action Models via On-Policy Distillation
Zhide Zhong, Haodong Yan, Junfeng Li, Junjie He, Tianran Zhang, Haoang Li

TL;DR
VLA-OPD introduces a novel on-policy distillation framework that combines offline supervised fine-tuning with online reinforcement learning, enhancing sample efficiency and robustness in vision-language-action models for robotic manipulation.
Contribution
It proposes a reverse-KL based on-policy distillation method that stabilizes training and preserves pre-trained capabilities while improving learning efficiency.
Findings
VLA-OPD outperforms pure RL in sample efficiency.
VLA-OPD maintains pre-trained capabilities better than standard SFT.
VLA-OPD shows robustness and reduced catastrophic forgetting on benchmarks.
Abstract
Although pre-trained Vision-Language-Action (VLA) models exhibit impressive generalization in robotic manipulation, post-training remains crucial to ensure reliable performance during deployment. However, standard offline Supervised Fine-Tuning (SFT) suffers from distribution shifts and catastrophic forgetting of pre-trained capabilities, while online Reinforcement Learning (RL) struggles with sparse rewards and poor sample efficiency. In this paper, we propose On-Policy VLA Distillation (VLA-OPD), a framework bridging the efficiency of SFT with the robustness of RL. Instead of relying on sparse environmental rewards, VLA-OPD leverages an expert teacher to provide dense, token-level supervision on the student's self-generated trajectories. This enables active error correction on policy-induced states while preserving pre-trained general capabilities through gentle alignment. Crucially,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
