Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Data

Jeonghye Kim; Yongjae Shin; Whiyoung Jung; Sunghoon Hong; Deunsol Yoon; Youngchul Sung; Kanghoon Lee; Woohyung Lim

arXiv:2507.08761·cs.LG·August 20, 2025

Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Data

Jeonghye Kim, Yongjae Shin, Whiyoung Jung, Sunghoon Hong, Deunsol Yoon, Youngchul Sung, Kanghoon Lee, Woohyung Lim

PDF

TL;DR

This paper introduces PARS, a new reinforcement learning algorithm that reduces Q-value extrapolation errors in offline data by using reward scaling and penalization of infeasible actions, leading to improved performance.

Contribution

The paper proposes a novel combination of reward scaling with layer normalization and penalization for infeasible actions to address Q-value extrapolation errors in offline reinforcement learning.

Findings

01

PARS outperforms state-of-the-art algorithms on D4RL benchmarks.

02

PARS achieves notable success in the challenging AntMaze Ultra task.

03

The approach effectively mitigates Q-value extrapolation errors in offline RL.

Abstract

Reinforcement learning with offline data suffers from Q-value extrapolation errors. To address this issue, we first demonstrate that linear extrapolation of the Q-function beyond the data range is particularly problematic. To mitigate this, we propose guiding the gradual decrease of Q-values outside the data range, which is achieved through reward scaling with layer normalization (RS-LN) and a penalization mechanism for infeasible actions (PA). By combining RS-LN and PA, we develop a new algorithm called PARS. We evaluate PARS across a range of tasks, demonstrating superior performance compared to state-of-the-art algorithms in both offline training and online fine-tuning on the D4RL benchmark, with notable success in the challenging AntMaze Ultra task.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.