OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

Sarvesh Patil; Mitsuhiko Nakamoto; Manan Agarwal; Shashwat Saxena; Jesse Zhang; Giri Anantharaman; Cleah Winston; Chaoyi Pan; Douglas Chen; Nai-Chieh Huang; Zeynep Temel; Oliver Kroemer; Sergey Levine; Abhishek Gupta; Hongkai Da; Paarth Shah; Max Simchowitz

arXiv:2605.03065·cs.LG·May 6, 2026

OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

Sarvesh Patil, Mitsuhiko Nakamoto, Manan Agarwal, Shashwat Saxena, Jesse Zhang, Giri Anantharaman, Cleah Winston, Chaoyi Pan, Douglas Chen, Nai-Chieh Huang, Zeynep Temel, Oliver Kroemer, Sergey Levine, Abhishek Gupta, Hongkai Da, Paarth Shah, Max Simchowitz

PDF

TL;DR

OGPO is a novel, sample-efficient off-policy algorithm that fine-tunes generative control policies for robot learning, achieving state-of-the-art results with minimal hyperparameter tuning.

Contribution

Introduces OGPO, a new off-policy finetuning method for GCPs that outperforms existing approaches and can improve poorly-initialized policies without expert data.

Findings

01

OGPO achieves state-of-the-art performance on manipulation tasks.

02

OGPO can fine-tune policies with no expert data in the replay buffer.

03

Proposed stabilizers improve training stability across settings.

Abstract

Generative control policies (GCPs), such as diffusion- and flow-based control policies, have emerged as effective parameterizations for robot learning. This work introduces Off-policy Generative Policy Optimization (OGPO), a sample-efficient algorithm for finetuning GCPs that maintains off-policy critic networks to maximize data reuse and propagate policy gradients through the full generative process of the policy via a modified PPO objective, using critics as the terminal reward. OGPO achieves state-of-the-art performance on manipulation tasks spanning multi-task settings, high-precision insertion, and dexterous control. To our knowledge, it is also the only method that can fine-tune poorly-initialized behavior cloning policies to near full task-success with no expert data in the online replay buffer, and does so with few task-specific hyperparameter tuning. Through extensive empirical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.