Adaptive Advantage-Guided Policy Regularization for Offline Reinforcement Learning
Tenglong Liu, Yang Li, Yixing Lan, Hao Gao, Wei Pan, Xin Xu

TL;DR
This paper introduces A2PR, a novel offline reinforcement learning method that adaptively guides policy regularization using advantage estimates and VAE-generated actions, improving performance on suboptimal datasets.
Contribution
A2PR is the first method to adaptively select high-advantage actions for policy regularization, balancing conservatism and policy improvement in offline RL.
Findings
Achieves state-of-the-art results on D4RL benchmarks.
Effectively mitigates value overestimation issues.
Performs well on suboptimal mixed datasets.
Abstract
In offline reinforcement learning, the challenge of out-of-distribution (OOD) is pronounced. To address this, existing methods often constrain the learned policy through policy regularization. However, these methods often suffer from the issue of unnecessary conservativeness, hampering policy improvement. This occurs due to the indiscriminate use of all actions from the behavior policy that generates the offline dataset as constraints. The problem becomes particularly noticeable when the quality of the dataset is suboptimal. Thus, we propose Adaptive Advantage-guided Policy Regularization (A2PR), obtaining high-advantage actions from an augmented behavior policy combined with VAE to guide the learned policy. A2PR can select high-advantage actions that differ from those present in the dataset, while still effectively maintaining conservatism from OOD actions. This is achieved by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdaptive Dynamic Programming Control · Elevator Systems and Control · Reinforcement Learning in Robotics
