Proximal Action Replacement for Behavior Cloning Actor-Critic in Offline Reinforcement Learning
Jinzong Dong, Wei Huang, Jianshu Zhang, Zhuo Chen, Xinzhe Yuan, Qinying Gu, Zhaohui Jiang, Nanyang Ye

TL;DR
This paper introduces Proximal Action Replacement (PAR), a method that improves offline RL by replacing suboptimal dataset actions with better ones, overcoming the performance ceiling of behavior cloning.
Contribution
The paper proposes PAR, a simple plug-and-play technique that enhances BC-regularized actor-critic methods by substituting actions based on value estimates, leading to improved offline RL performance.
Findings
PAR consistently improves performance across benchmarks.
Combining PAR with TD3+BC approaches state-of-the-art results.
PAR effectively breaks the imitation ceiling imposed by suboptimal data.
Abstract
Offline reinforcement learning (RL), which optimizes policies using a previously collected static dataset, is an important branch of RL. A popular and promising approach is to regularize actor-critic methods with behavior cloning (BC), which quickly yields realistic policies and mitigates bias from out-of-distribution actions, but it can impose an often-overlooked performance ceiling: when dataset actions are suboptimal, indiscriminate imitation structurally prevents the actor from fully exploiting better actions suggested by the value function, especially in later training when imitation is already dominant. We formally analyzed this limitation by investigating convergence properties of BC-regularized actor-critic optimization and verified it on a controlled continuous bandit task. To break this ceiling, we propose proximal action replacement (PAR), an easy-to-use plug-and-play…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
