Taming the Noise in Reinforcement Learning via Soft Updates
Roy Fox, Ari Pakman, Naftali Tishby

TL;DR
G-learning is a novel off-policy reinforcement learning algorithm that reduces bias and accelerates convergence in noisy environments by regularizing value estimates and incorporating prior knowledge.
Contribution
The paper introduces G-learning, a new off-policy algorithm that penalizes deterministic policies early on, reducing bias and improving learning speed in noisy settings.
Findings
G-learning achieves faster convergence compared to traditional methods.
It effectively incorporates prior domain knowledge.
Results show reduced exploration costs and improved learning efficiency.
Abstract
Model-free reinforcement learning algorithms, such as Q-learning, perform poorly in the early stages of learning in noisy environments, because much effort is spent unlearning biased estimates of the state-action value function. The bias results from selecting, among several noisy estimates, the apparent optimum, which may actually be suboptimal. We propose G-learning, a new off-policy learning algorithm that regularizes the value estimates by penalizing deterministic policies in the beginning of the learning process. We show that this method reduces the bias of the value-function estimation, leading to faster convergence to the optimal value and the optimal policy. Moreover, G-learning enables the natural incorporation of prior domain knowledge, when available. The stochastic nature of G-learning also makes it avoid some exploration costs, a property usually attributed only to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Machine Learning and Algorithms · Advanced Bandit Algorithms Research
