Reward Centering
Abhishek Naik, Yi Wan, Manan Tomar, Richard S. Sutton

TL;DR
This paper demonstrates that centering rewards by subtracting their empirical average significantly improves the performance of discounted reinforcement learning methods, especially at high discount factors, and provides methods for estimating this average in different settings.
Contribution
The paper introduces reward centering as a general technique to enhance reinforcement learning algorithms and proposes practical methods for estimating the average reward in on-policy and off-policy scenarios.
Findings
Reward centering improves performance at high discount factors.
Methods with reward centering are unaffected by reward shifts.
Reward centering benefits nearly all reinforcement learning algorithms.
Abstract
We show that discounted methods for solving continuing reinforcement learning problems can perform significantly better if they center their rewards by subtracting out the rewards' empirical average. The improvement is substantial at commonly used discount factors and increases further as the discount factor approaches one. In addition, we show that if a problem's rewards are shifted by a constant, then standard methods perform much worse, whereas methods with reward centering are unaffected. Estimating the average reward is straightforward in the on-policy setting; we propose a slightly more sophisticated method for the off-policy setting. Reward centering is a general idea, so we expect almost every reinforcement-learning algorithm to benefit by the addition of reward centering.
Peer Reviews
Decision·Submitted to ICLR 2024
- It addresses a practical and essential problem of learning long-term optimal policies in continuing problems with discounting. - It proposes a simple and effective technique that can be easily applied to existing algorithms without changing their core structure or adding much computational overhead. - It provides a clear and rigorous theoretical analysis of the convergence and variance properties of Centered Q-learning in the tabular case. - It presents comprehensive and convincing empirical
- It does not extend the theoretical analysis to the function approximation case, which is more challenging and relevant for real-world applications. - It does not compare reward centering with similar techniques that improve discounting, such as reward scaling [1], GAE [2], etc. - The paper abuses notations: iteration number is $t$, and timestep is also $t$. - It lacks cross-comparison from the lens of RL algorithms and hyper-parameter settings (e.g., $\epsilon$ for Q-learning). - The motivati
* The paper provides a clear motivation for the reward centering technique by analyzing the issues with standard RL methods when the discount factor is close to one. * The authors present a comprehensive theoretical analysis of the convergence properties of the proposed Centered Q-learning algorithm. * The empirical results demonstrate the benefits of reward centering across different domains and function approximation techniques, including tabular, linear, and non-linear methods. * The paper di
* The paper focuses primarily on the tabular case for the theoretical analysis, and the convergence results may not directly apply to the function approximation case. * The paper does not provide a detailed comparison of the proposed method with other state-of-the-art RL algorithms, which would help to better understand its relative performance.
- The authors develop good intuitions on the practical values of the common practice of reward centering. These intuitions are supported by some simple theoretical analysis and experiments in several domains. Since reward centering is of practical value, I believe this work to be interesting for the community. - The paper is well-written (with some unclear points in the theoretical part; see weakness 2 below). - The problem is relatively novel, although some works on the topic exist.
1. **Empirical results** are on simple domains. I would ask the authors whether they expect their empirical results to hold in more intricated benchmarks (e.g., some Atari domain, for instance). Have the authors experimented with these more complex domains (or other domains of similar complexity)? Indeed, I retain the paper to be mainly of an empirical nature. As a consequence, I think this point could add substantial value to the submission. Currently, I believe this to be a major weakness of t
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Behavioral and Psychological Studies
