Non-Deterministic Policy Improvement Stabilizes Approximated Reinforcement Learning
Wendelin B\"ohmer, Rong Guo, Klaus Obermayer

TL;DR
This paper demonstrates that non-deterministic policy improvement enhances stability in approximated reinforcement learning methods, such as LSPI, by reducing stochasticity and employing better value function representations, with potential applicability to deep RL.
Contribution
It introduces non-deterministic policy improvement as a simple technique to stabilize approximate RL algorithms, including LSPI and potentially deep RL.
Findings
Non-deterministic policy improvement stabilizes LSPI.
Value function representation also contributes to stability.
Approach is simple and transferable to deep RL.
Abstract
This paper investigates a type of instability that is linked to the greedy policy improvement in approximated reinforcement learning. We show empirically that non-deterministic policy improvement can stabilize methods like LSPI by controlling the improvements' stochasticity. Additionally we show that a suitable representation of the value function also stabilizes the solution to some degree. The presented approach is simple and should also be easily transferable to more sophisticated algorithms like deep reinforcement learning.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Evolutionary Algorithms and Applications · Advanced Multi-Objective Optimization Algorithms
