TL;DR
This paper introduces Hybrid Policy Optimization (HPO), a new reinforcement learning method for hybrid discrete-continuous action spaces that combines unbiased mixed gradients and outperforms existing algorithms like PPO in complex control tasks.
Contribution
HPO effectively combines pathwise and score-function gradients for hybrid actions, addressing credit-assignment issues and enabling scalable, unbiased policy optimization in hybrid spaces.
Findings
HPO outperforms PPO on inventory control and switched LQ problems.
Performance gaps increase with higher continuous action dimensions.
The mixed gradient's cross term diminishes near a discrete best response, enabling decentralized updates.
Abstract
We study reinforcement learning in hybrid discrete-continuous action spaces, such as settings where the discrete component selects a regime (or index) and the continuous component optimizes within it -- a structure common in robotics, control, and operations problems. Standard model-free policy gradient methods rely on score-function (SF) estimators and suffer from severe credit-assignment issues in high-dimensional settings, leading to poor gradient quality. On the other hand, differentiable simulation largely sidesteps these issues by backpropagating through a simulator, but the presence of discrete actions or non-smooth dynamics yields biased or uninformative gradients. To address this, we propose Hybrid Policy Optimization (HPO), which backpropagates through the simulator wherever smoothness permits, using a mixed gradient estimator that combines pathwise and SF gradients while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
