Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class   and Backbone

Max Sobol Mark; Tian Gao; Georgia Gabriela Sampaio; Mohan Kumar; Srirama; Archit Sharma; Chelsea Finn; and Aviral Kumar

arXiv:2412.06685·cs.LG·December 10, 2024

Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone

Max Sobol Mark, Tian Gao, Georgia Gabriela Sampaio, Mohan Kumar, Srirama, Archit Sharma, Chelsea Finn, and Aviral Kumar

PDF

Open Access

TL;DR

This paper introduces policy-agnostic RL (PA-RL), a versatile offline and online RL method that effectively trains and fine-tunes diverse policy architectures, including diffusion and transformer models, with improved performance and efficiency.

Contribution

PA-RL replaces traditional policy improvement with a universal supervised learning loss, enabling training of various policy classes via action optimization, and demonstrates significant performance gains.

Findings

01

PA-RL doubles sample efficiency compared to existing methods.

02

Successfully fine-tuned a 7B generalist robot policy in real-world in 40 minutes.

03

Enables training and fine-tuning of diverse policy architectures with a unified approach.

Abstract

Recent advances in learning decision-making policies can largely be attributed to training expressive policy models, largely via imitation learning. While imitation learning discards non-expert data, reinforcement learning (RL) can still learn from suboptimal data. However, instantiating RL training of a new policy class often presents a different challenge: most deep RL machinery is co-developed with assumptions on the policy class and backbone, resulting in poor performance when the policy class changes. For instance, SAC utilizes a low-variance reparameterization policy gradient for Gaussian policies, but this is unstable for diffusion policies and intractable for autoregressive categorical policies. To address this issue, we develop an offline RL and online fine-tuning approach called policy-agnostic RL (PA-RL) that can effectively train multiple policy classes, with varying…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Robot Manipulation and Learning

MethodsDilated Convolution · Average Pooling · Convolution · 1x1 Convolution · Global Average Pooling · Balanced Selection · Switchable Atrous Convolution · Diffusion