TL;DR
This paper introduces EXPO, a novel reinforcement learning method that enhances training stability and sample efficiency for expressive policies by using an on-the-fly editing policy instead of direct value optimization.
Contribution
The paper proposes a new algorithm, EXPO, which improves stability and efficiency in training expressive policies by combining a base policy with an action editing policy.
Findings
Achieves 2-3x better sample efficiency than prior methods.
Effective in fine-tuning pretrained policies with offline data.
Improves stability in training expressive policies like diffusion and flow-matching.
Abstract
We study the problem of training and fine-tuning expressive policies with online reinforcement learning (RL) given an offline dataset. Training expressive policy classes with online RL present a unique challenge of stable value maximization. Unlike simpler Gaussian policies commonly used in online RL, expressive policies like diffusion and flow-matching policies are parameterized by a long denoising chain, which hinders stable gradient propagation from actions to policy parameters when optimizing against some value function. Our key insight is that we can address stable value maximization by avoiding direct optimization over value with the expressive policy and instead construct an on-the-fly RL policy to maximize Q-value. We propose Expressive Policy Optimization (EXPO), a sample-efficient online RL algorithm that utilizes an on-the-fly policy to maximize value with two parameterized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
