TD3 with Reverse KL Regularizer for Offline Reinforcement Learning from   Mixed Datasets

Yuanying Cai; Chuheng Zhang; Li Zhao; Wei Shen; Xuyun Zhang; Lei Song,; Jiang Bian; Tao Qin; Tieyan Liu

arXiv:2212.02125·stat.ML·December 6, 2022

TD3 with Reverse KL Regularizer for Offline Reinforcement Learning from Mixed Datasets

Yuanying Cai, Chuheng Zhang, Li Zhao, Wei Shen, Xuyun Zhang, Lei Song,, Jiang Bian, Tao Qin, Tieyan Liu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel offline RL method using adaptively weighted reverse KL divergence as a behavior cloning regularizer, effectively balancing RL and BC signals and avoiding OOD actions in mixed datasets.

Contribution

It proposes a new approach with per-state adaptive weighting and mode-seeking reverse KL regularization to improve offline RL from mixed datasets.

Findings

01

Outperforms existing offline RL algorithms on MuJoCo tasks.

02

Effectively balances RL and BC signals with per-state weights.

03

Avoids out-of-distribution actions through mode-seeking regularization.

Abstract

We consider an offline reinforcement learning (RL) setting where the agent need to learn from a dataset collected by rolling out multiple behavior policies. There are two challenges for this setting: 1) The optimal trade-off between optimizing the RL signal and the behavior cloning (BC) signal changes on different states due to the variation of the action coverage induced by different behavior policies. Previous methods fail to handle this by only controlling the global trade-off. 2) For a given state, the action distribution generated by different behavior policies may have multiple modes. The BC regularizers in many previous methods are mean-seeking, resulting in policies that select out-of-distribution (OOD) actions in the middle of the modes. In this paper, we address both challenges by using adaptively weighted reverse Kullback-Leibler (KL) divergence as the BC regularizer based on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuanying-cc/td3-rkl
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Robotic Locomotion and Control

Methodsfail · Experience Replay · Clipped Double Q-learning · Target Policy Smoothing · Dense Connections · Adam · *Communicated@Fast*How Do I Communicate to Expedia? · Twin Delayed Deep Deterministic