Potential-Based Advice for Stochastic Policy Learning
Baicen Xiao, Bhaskar Ramasubramanian, Andrew Clark, Hannaneh, Hajishirzi, Linda Bushnell, Radha Poovendran

TL;DR
This paper introduces a potential-based reward shaping method for stochastic policy learning in reinforcement learning, which preserves optimality and accelerates learning in complex environments.
Contribution
It presents a novel potential-based advice scheme compatible with stochastic policies and policy gradient methods, with convergence guarantees and empirical validation.
Findings
Faster learning of stochastic optimal policies.
Higher average rewards in tested environments.
Preservation of policy optimality with reward shaping.
Abstract
This paper augments the reward received by a reinforcement learning agent with potential functions in order to help the agent learn (possibly stochastic) optimal policies. We show that a potential-based reward shaping scheme is able to preserve optimality of stochastic policies, and demonstrate that the ability of an agent to learn an optimal policy is not affected when this scheme is augmented to soft Q-learning. We propose a method to impart potential based advice schemes to policy gradient algorithms. An algorithm that considers an advantage actor-critic architecture augmented with this scheme is proposed, and we give guarantees on its convergence. Finally, we evaluate our approach on a puddle-jump grid world with indistinguishable states, and the continuous state and action mountain car environment from classical control. Our results indicate that these schemes allow the agent to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Transportation and Mobility Innovations
