Analyzing and Bridging the Gap between Maximizing Total Reward and Discounted Reward in Deep Reinforcement Learning
Shuyu Yin, Fei Wen, Peilin Liu, Tao Luo

TL;DR
This paper analyzes the performance gap between maximizing total reward and discounted reward in deep reinforcement learning, proposing methods to align these objectives for improved policy optimization.
Contribution
It provides a theoretical analysis of the reward gap and introduces two novel approaches to align total and discounted rewards in deep RL.
Findings
Increasing the discount factor may not eliminate the reward gap in cyclic environments.
Modifying terminal state values can help align total and discounted rewards.
Calibrating reward data improves robustness and performance in off-policy deep RL.
Abstract
The optimal objective is a fundamental aspect of reinforcement learning (RL), as it determines how policies are evaluated and optimized. While total return maximization is the ideal objective in RL, discounted return maximization is the practical objective due to its stability. This can lead to a misalignment of objectives. To better understand the problem, we theoretically analyze the performance gap between the policy maximizes the total return and the policy maximizes the discounted return. Our analysis reveals that increasing the discount factor can be ineffective at eliminating this gap when environment contains cyclic states,a frequent scenario. To address this issue, we propose two alternative approaches to align the objectives. The first approach achieves alignment by modifying the terminal state value, treating it as a tunable hyper-parameter with its suitable range defined…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEEG and Brain-Computer Interfaces
MethodsALIGN
