Improving On-policy Learning with Statistical Reward Accumulation
Yubin Deng, Ke Yu, Dahua Lin, Xiaoou Tang, Chen Change Loy

TL;DR
This paper enhances on-policy reinforcement learning by incorporating statistical reward history and a new exploration method, significantly improving performance in sparse reward environments like Atari and MuJoCo.
Contribution
It introduces a novel approach combining reward statistics with multi-critic value functions and a new exploration mechanism called hot-wiring for better on-policy learning.
Findings
Improved performance in Atari and MuJoCo benchmarks.
Effective handling of sparse reward signals.
Enhanced value function approximation with multi-critics.
Abstract
Deep reinforcement learning has obtained significant breakthroughs in recent years. Most methods in deep-RL achieve good results via the maximization of the reward signal provided by the environment, typically in the form of discounted cumulative returns. Such reward signals represent the immediate feedback of a particular action performed by an agent. However, tasks with sparse reward signals are still challenging to on-policy methods. In this paper, we introduce an effective characterization of past reward statistics (which can be seen as long-term feedback signals) to supplement this immediate reward feedback. In particular, value functions are learned with multi-critics supervision, enabling complex value functions to be more easily approximated in on-policy learning, even when the reward signals are sparse. We also introduce a novel exploration mechanism called "hot-wiring" that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Artificial Intelligence in Games
