Hierarchical Average Reward Policy Gradient Algorithms
Akshay Dharmavaram, Matthew Riemer, Shalabh Bhatnagar

TL;DR
This paper extends hierarchical option-critic policy gradient algorithms to the average reward setting, addressing long-term credit assignment issues in reinforcement learning and providing convergence guarantees.
Contribution
It introduces a novel average reward hierarchical option-critic framework with convergence proofs and demonstrates its effectiveness in sparse reward environments.
Findings
Convergence of intra-option policies, termination functions, and value functions to optimal values.
Enhanced performance in grid-world environments with sparse rewards.
Theoretical extension of option-critic algorithms to average reward criteria.
Abstract
Option-critic learning is a general-purpose reinforcement learning (RL) framework that aims to address the issue of long term credit assignment by leveraging temporal abstractions. However, when dealing with extended timescales, discounting future rewards can lead to incorrect credit assignments. In this work, we address this issue by extending the hierarchical option-critic policy gradient theorem for the average reward criterion. Our proposed framework aims to maximize the long-term reward obtained in the steady-state of the Markov chain defined by the agent's policy. Furthermore, we use an ordinary differential equation based approach for our convergence analysis and prove that the parameters of the intra-option policies, termination functions, and value functions, converge to their corresponding optimal values, with probability one. Finally, we illustrate the competitive advantage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Smart Grid Energy Management · Traffic control and management
