Reward Centering

Abhishek Naik; Yi Wan; Manan Tomar; Richard S. Sutton

arXiv:2405.09999·cs.LG·October 31, 2024·3 cites

Reward Centering

Abhishek Naik, Yi Wan, Manan Tomar, Richard S. Sutton

PDF

Open Access 2 Repos 3 Reviews

TL;DR

This paper demonstrates that centering rewards by subtracting their empirical average significantly improves the performance of discounted reinforcement learning methods, especially at high discount factors, and provides methods for estimating this average in different settings.

Contribution

The paper introduces reward centering as a general technique to enhance reinforcement learning algorithms and proposes practical methods for estimating the average reward in on-policy and off-policy scenarios.

Findings

01

Reward centering improves performance at high discount factors.

02

Methods with reward centering are unaffected by reward shifts.

03

Reward centering benefits nearly all reinforcement learning algorithms.

Abstract

We show that discounted methods for solving continuing reinforcement learning problems can perform significantly better if they center their rewards by subtracting out the rewards' empirical average. The improvement is substantial at commonly used discount factors and increases further as the discount factor approaches one. In addition, we show that if a problem's rewards are shifted by a constant, then standard methods perform much worse, whereas methods with reward centering are unaffected. Estimating the average reward is straightforward in the on-policy setting; we propose a slightly more sophisticated method for the off-policy setting. Reward centering is a general idea, so we expect almost every reinforcement-learning algorithm to benefit by the addition of reward centering.

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 3· reject, not good enoughConfidence 4

Strengths

- It addresses a practical and essential problem of learning long-term optimal policies in continuing problems with discounting. - It proposes a simple and effective technique that can be easily applied to existing algorithms without changing their core structure or adding much computational overhead. - It provides a clear and rigorous theoretical analysis of the convergence and variance properties of Centered Q-learning in the tabular case. - It presents comprehensive and convincing empirical

Weaknesses

- It does not extend the theoretical analysis to the function approximation case, which is more challenging and relevant for real-world applications. - It does not compare reward centering with similar techniques that improve discounting, such as reward scaling [1], GAE [2], etc. - The paper abuses notations: iteration number is $t$, and timestep is also $t$. - It lacks cross-comparison from the lens of RL algorithms and hyper-parameter settings (e.g., $\epsilon$ for Q-learning). - The motivati

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

* The paper provides a clear motivation for the reward centering technique by analyzing the issues with standard RL methods when the discount factor is close to one. * The authors present a comprehensive theoretical analysis of the convergence properties of the proposed Centered Q-learning algorithm. * The empirical results demonstrate the benefits of reward centering across different domains and function approximation techniques, including tabular, linear, and non-linear methods. * The paper di

Weaknesses

* The paper focuses primarily on the tabular case for the theoretical analysis, and the convergence results may not directly apply to the function approximation case. * The paper does not provide a detailed comparison of the proposed method with other state-of-the-art RL algorithms, which would help to better understand its relative performance.

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

- The authors develop good intuitions on the practical values of the common practice of reward centering. These intuitions are supported by some simple theoretical analysis and experiments in several domains. Since reward centering is of practical value, I believe this work to be interesting for the community. - The paper is well-written (with some unclear points in the theoretical part; see weakness 2 below). - The problem is relatively novel, although some works on the topic exist.

Weaknesses

1. **Empirical results** are on simple domains. I would ask the authors whether they expect their empirical results to hold in more intricated benchmarks (e.g., some Atari domain, for instance). Have the authors experimented with these more complex domains (or other domains of similar complexity)? Indeed, I retain the paper to be mainly of an empirical nature. As a consequence, I think this point could add substantial value to the submission. Currently, I believe this to be a major weakness of t

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Behavioral and Psychological Studies