Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone
Max Sobol Mark, Tian Gao, Georgia Gabriela Sampaio, Mohan Kumar, Srirama, Archit Sharma, Chelsea Finn, and Aviral Kumar

TL;DR
This paper introduces policy-agnostic RL (PA-RL), a versatile offline and online RL method that effectively trains and fine-tunes diverse policy architectures, including diffusion and transformer models, with improved performance and efficiency.
Contribution
PA-RL replaces traditional policy improvement with a universal supervised learning loss, enabling training of various policy classes via action optimization, and demonstrates significant performance gains.
Findings
PA-RL doubles sample efficiency compared to existing methods.
Successfully fine-tuned a 7B generalist robot policy in real-world in 40 minutes.
Enables training and fine-tuning of diverse policy architectures with a unified approach.
Abstract
Recent advances in learning decision-making policies can largely be attributed to training expressive policy models, largely via imitation learning. While imitation learning discards non-expert data, reinforcement learning (RL) can still learn from suboptimal data. However, instantiating RL training of a new policy class often presents a different challenge: most deep RL machinery is co-developed with assumptions on the policy class and backbone, resulting in poor performance when the policy class changes. For instance, SAC utilizes a low-variance reparameterization policy gradient for Gaussian policies, but this is unstable for diffusion policies and intractable for autoregressive categorical policies. To address this issue, we develop an offline RL and online fine-tuning approach called policy-agnostic RL (PA-RL) that can effectively train multiple policy classes, with varying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Robot Manipulation and Learning
MethodsDilated Convolution · Average Pooling · Convolution · 1x1 Convolution · Global Average Pooling · Balanced Selection · Switchable Atrous Convolution · Diffusion
