Non-Deterministic Policy Improvement Stabilizes Approximated   Reinforcement Learning

Wendelin B\"ohmer; Rong Guo; Klaus Obermayer

arXiv:1612.07548·cs.AI·December 23, 2016·1 cites

Non-Deterministic Policy Improvement Stabilizes Approximated Reinforcement Learning

Wendelin B\"ohmer, Rong Guo, Klaus Obermayer

PDF

Open Access

TL;DR

This paper demonstrates that non-deterministic policy improvement enhances stability in approximated reinforcement learning methods, such as LSPI, by reducing stochasticity and employing better value function representations, with potential applicability to deep RL.

Contribution

It introduces non-deterministic policy improvement as a simple technique to stabilize approximate RL algorithms, including LSPI and potentially deep RL.

Findings

01

Non-deterministic policy improvement stabilizes LSPI.

02

Value function representation also contributes to stability.

03

Approach is simple and transferable to deep RL.

Abstract

This paper investigates a type of instability that is linked to the greedy policy improvement in approximated reinforcement learning. We show empirically that non-deterministic policy improvement can stabilize methods like LSPI by controlling the improvements' stochasticity. Additionally we show that a suitable representation of the value function also stabilizes the solution to some degree. The presented approach is simple and should also be easily transferable to more sophisticated algorithms like deep reinforcement learning.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Evolutionary Algorithms and Applications · Advanced Multi-Objective Optimization Algorithms