TL;DR
This paper proposes a unified reinforcement learning framework that treats parameterized action distributions as actions, enabling a continuous action space for diverse action types and introducing new algorithms with promising empirical results.
Contribution
It introduces a novel reparameterization of action spaces as distributions, along with a generalized policy gradient and critic learning strategies, advancing RL across various action types.
Findings
DA-AC achieves competitive performance across different control settings.
The new gradient estimator has lower variance than traditional methods.
Interpolated Critic Learning improves stability in training.
Abstract
We introduce a novel reinforcement learning (RL) framework that treats parameterized action distributions as actions, redefining the boundary between agent and environment. This reparameterization makes the new action space continuous, regardless of the original action type (discrete, continuous, hybrid, etc.). Under this new parameterization, we develop a generalized deterministic policy gradient estimator, Distributions-as-Actions Policy Gradient (DA-PG), which has lower variance than the gradient in the original action space. Although learning the critic over distribution parameters poses new challenges, we introduce Interpolated Critic Learning (ICL), a simple yet effective strategy to enhance learning, supported by insights from bandit settings. Building on TD3, a strong baseline for continuous control, we propose a practical actor-critic algorithm, Distributions-as-Actions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
