Efficient Entropy for Policy Gradient with Multidimensional Action Space
Yiming Zhang, Quan Ho Vuong, Kenny Song, Xiao-Yue Gong, Keith W. Ross

TL;DR
This paper introduces novel unbiased estimators for entropy in high-dimensional action spaces to improve exploration in policy gradient reinforcement learning, demonstrating enhanced performance with minimal extra computation.
Contribution
It develops new entropy estimators suitable for high-dimensional discrete action spaces and applies them to various policy models, improving exploration efficiency.
Findings
Entropy estimators significantly boost policy performance.
Methods are computationally efficient with marginal overhead.
Effective in multi-agent and complex environments.
Abstract
In recent years, deep reinforcement learning has been shown to be adept at solving sequential decision processes with high-dimensional state spaces such as in the Atari games. Many reinforcement learning problems, however, involve high-dimensional discrete action spaces as well as high-dimensional state spaces. This paper considers entropy bonus, which is used to encourage exploration in policy gradient. In the case of high-dimensional action spaces, calculating the entropy and its gradient requires enumerating all the actions in the action space and running forward and backpropagation for each action, which may be computationally infeasible. We develop several novel unbiased estimators for the entropy bonus and its gradient. We apply these estimators to several models for the parameterized policies, including Independent Sampling, CommNet, Autoregressive with Modified MDP, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Data Stream Mining Techniques
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
