Opinion: Towards Unified Expressive Policy Optimization for Robust Robot Learning
Haidong Huang, Haiyue Zhu. Jiayu Song, Xixin Zhao, Yaohua Zhou, Jiayi Zhang, Yuze Zhai, Xiaocong Li

TL;DR
This paper introduces UEPO, a unified generative framework for offline-to-online reinforcement learning in robotics, addressing multimodal behavior coverage and distributional shifts with novel diffusion-based techniques.
Contribution
The paper presents a multi-seed diffusion policy, a dynamic divergence regularization, and a diffusion-based data augmentation, advancing robust policy optimization in robotic learning.
Findings
Achieves +5.9% on locomotion tasks
Achieves +12.4% on dexterous manipulation
Demonstrates strong generalization and scalability
Abstract
Offline-to-online reinforcement learning (O2O-RL) has emerged as a promising paradigm for safe and efficient robotic policy deployment but suffers from two fundamental challenges: limited coverage of multimodal behaviors and distributional shifts during online adaptation. We propose UEPO, a unified generative framework inspired by large language model pretraining and fine-tuning strategies. Our contributions are threefold: (1) a multi-seed dynamics-aware diffusion policy that efficiently captures diverse modalities without training multiple models; (2) a dynamic divergence regularization mechanism that enforces physically meaningful policy diversity; and (3) a diffusion-based data augmentation module that enhances dynamics model generalization. On the D4RL benchmark, UEPO achieves +5.9\% absolute improvement over Uni-O4 on locomotion tasks and +12.4\% on dexterous manipulation,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Robot Manipulation and Learning · Domain Adaptation and Few-Shot Learning
