TD3 with Reverse KL Regularizer for Offline Reinforcement Learning from Mixed Datasets
Yuanying Cai, Chuheng Zhang, Li Zhao, Wei Shen, Xuyun Zhang, Lei Song,, Jiang Bian, Tao Qin, Tieyan Liu

TL;DR
This paper introduces a novel offline RL method using adaptively weighted reverse KL divergence as a behavior cloning regularizer, effectively balancing RL and BC signals and avoiding OOD actions in mixed datasets.
Contribution
It proposes a new approach with per-state adaptive weighting and mode-seeking reverse KL regularization to improve offline RL from mixed datasets.
Findings
Outperforms existing offline RL algorithms on MuJoCo tasks.
Effectively balances RL and BC signals with per-state weights.
Avoids out-of-distribution actions through mode-seeking regularization.
Abstract
We consider an offline reinforcement learning (RL) setting where the agent need to learn from a dataset collected by rolling out multiple behavior policies. There are two challenges for this setting: 1) The optimal trade-off between optimizing the RL signal and the behavior cloning (BC) signal changes on different states due to the variation of the action coverage induced by different behavior policies. Previous methods fail to handle this by only controlling the global trade-off. 2) For a given state, the action distribution generated by different behavior policies may have multiple modes. The BC regularizers in many previous methods are mean-seeking, resulting in policies that select out-of-distribution (OOD) actions in the middle of the modes. In this paper, we address both challenges by using adaptively weighted reverse Kullback-Leibler (KL) divergence as the BC regularizer based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Robotic Locomotion and Control
Methodsfail · Experience Replay · Clipped Double Q-learning · Target Policy Smoothing · Dense Connections · Adam · *Communicated@Fast*How Do I Communicate to Expedia? · Twin Delayed Deep Deterministic
