The Divergence of Reinforcement Learning Algorithms with Value-Iteration and Function Approximation
Michael Fairbank, Eduardo Alonso

TL;DR
This paper demonstrates that several major reinforcement learning algorithms can diverge when using function approximation in a value iteration setting, including some previously thought to be stable, highlighting potential pitfalls in their application.
Contribution
The paper provides new divergence examples for value-iteration-based RL algorithms with function approximation, including TD(1) and Sarsa(1), in a greedy policy scenario.
Findings
Divergence occurs for TD(1) and Sarsa(1) with greedy policies.
Major RL algorithms like HDP, DHP, and GDHP can diverge under value iteration.
Divergence examples differ from previous literature, applicable in greedy policy scenarios.
Abstract
This paper gives specific divergence examples of value-iteration for several major Reinforcement Learning and Adaptive Dynamic Programming algorithms, when using a function approximator for the value function. These divergence examples differ from previous divergence examples in the literature, in that they are applicable for a greedy policy, i.e. in a "value iteration" scenario. Perhaps surprisingly, with a greedy policy, it is also possible to get divergence for the algorithms TD(1) and Sarsa(1). In addition to these divergences, we also achieve divergence for the Adaptive Dynamic Programming algorithms HDP, DHP and GDHP.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdaptive Dynamic Programming Control · Reinforcement Learning in Robotics · Smart Grid Energy Management
