TL;DR
POCO is a novel RL framework that improves policy training stability and efficiency by formulating policy improvement as a posterior inference problem, effectively scaling to large models and real-world tasks.
Contribution
It introduces POCO, a posterior inference-based RL method with an offline-to-online paradigm, enabling stable, efficient fine-tuning of expressive generative policies for robotics.
Findings
POCO prevents catastrophic policy collapse in complex tasks.
It outperforms state-of-the-art baselines across benchmarks.
Achieves 96.7% success rate on real-world contact-rich tasks.
Abstract
Expressive generative models have advanced robotic manipulation by capturing complex, multi-modal action distributions over temporally extended trajectories. However, fine-tuning these policies via RL remains challenging due to instability and sample inefficiency. We introduce Posterior Optimization with Clipped Objective (POCO), a principled RL framework that formulates policy improvement as a posterior inference problem tailored for temporal action chunks. Through an Expectation-Maximization procedure, POCO distills a reward-weighted implicit posterior into the policy without likelihood estimation. Furthermore, POCO adopts an offline-to-online paradigm that anchors online exploration to pre-trained priors, and its model-agnostic design scales to fine-tune large VLA models without architectural modifications. Evaluations across 7 simulation benchmarks and 4 contact-rich real-world…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
