Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Data
Jeonghye Kim, Yongjae Shin, Whiyoung Jung, Sunghoon Hong, Deunsol Yoon, Youngchul Sung, Kanghoon Lee, Woohyung Lim

TL;DR
This paper introduces PARS, a new reinforcement learning algorithm that reduces Q-value extrapolation errors in offline data by using reward scaling and penalization of infeasible actions, leading to improved performance.
Contribution
The paper proposes a novel combination of reward scaling with layer normalization and penalization for infeasible actions to address Q-value extrapolation errors in offline reinforcement learning.
Findings
PARS outperforms state-of-the-art algorithms on D4RL benchmarks.
PARS achieves notable success in the challenging AntMaze Ultra task.
The approach effectively mitigates Q-value extrapolation errors in offline RL.
Abstract
Reinforcement learning with offline data suffers from Q-value extrapolation errors. To address this issue, we first demonstrate that linear extrapolation of the Q-function beyond the data range is particularly problematic. To mitigate this, we propose guiding the gradual decrease of Q-values outside the data range, which is achieved through reward scaling with layer normalization (RS-LN) and a penalization mechanism for infeasible actions (PA). By combining RS-LN and PA, we develop a new algorithm called PARS. We evaluate PARS across a range of tasks, demonstrating superior performance compared to state-of-the-art algorithms in both offline training and online fine-tuning on the D4RL benchmark, with notable success in the challenging AntMaze Ultra task.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
