VLA-OPD: Bridging Offline SFT and Online RL for Vision-Language-Action Models via On-Policy Distillation

Zhide Zhong; Haodong Yan; Junfeng Li; Junjie He; Tianran Zhang; Haoang Li

arXiv:2603.26666·cs.RO·March 30, 2026

VLA-OPD: Bridging Offline SFT and Online RL for Vision-Language-Action Models via On-Policy Distillation

Zhide Zhong, Haodong Yan, Junfeng Li, Junjie He, Tianran Zhang, Haoang Li

PDF

TL;DR

VLA-OPD introduces a novel on-policy distillation framework that combines offline supervised fine-tuning with online reinforcement learning, enhancing sample efficiency and robustness in vision-language-action models for robotic manipulation.

Contribution

It proposes a reverse-KL based on-policy distillation method that stabilizes training and preserves pre-trained capabilities while improving learning efficiency.

Findings

01

VLA-OPD outperforms pure RL in sample efficiency.

02

VLA-OPD maintains pre-trained capabilities better than standard SFT.

03

VLA-OPD shows robustness and reduced catastrophic forgetting on benchmarks.

Abstract

Although pre-trained Vision-Language-Action (VLA) models exhibit impressive generalization in robotic manipulation, post-training remains crucial to ensure reliable performance during deployment. However, standard offline Supervised Fine-Tuning (SFT) suffers from distribution shifts and catastrophic forgetting of pre-trained capabilities, while online Reinforcement Learning (RL) struggles with sparse rewards and poor sample efficiency. In this paper, we propose On-Policy VLA Distillation (VLA-OPD), a framework bridging the efficiency of SFT with the robustness of RL. Instead of relying on sparse environmental rewards, VLA-OPD leverages an expert teacher to provide dense, token-level supervision on the student's self-generated trajectories. This enables active error correction on policy-induced states while preserving pre-trained general capabilities through gentle alignment. Crucially,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.