Model-based Reinforcement Learning for Parameterized Action Spaces
Renhao Zhang, Haotian Fu, Yilin Miao, George Konidaris

TL;DR
This paper introduces DLPA, a model-based reinforcement learning algorithm for parameterized action spaces that learns dynamics models and plans with predictive control, demonstrating improved sample efficiency and performance.
Contribution
The paper presents a novel algorithm combining dynamics learning and predictive control specifically designed for parameterized action spaces in reinforcement learning.
Findings
Achieves superior sample efficiency on benchmark tasks.
Outperforms existing PAMDP methods in asymptotic performance.
Provides theoretical analysis of trajectory optimality differences.
Abstract
We propose a novel model-based reinforcement learning algorithm -- Dynamics Learning and predictive control with Parameterized Actions (DLPA) -- for Parameterized Action Markov Decision Processes (PAMDPs). The agent learns a parameterized-action-conditioned dynamics model and plans with a modified Model Predictive Path Integral control. We theoretically quantify the difference between the generated trajectory and the optimal trajectory during planning in terms of the value they achieved through the lens of Lipschitz Continuity. Our empirical results on several standard benchmarks show that our algorithm achieves superior sample efficiency and asymptotic performance than state-of-the-art PAMDP methods.
Peer Reviews
Decision·ICML 2024 Poster
+ The proposed framework is technically sound and the key idea of DLPA is clear. + Extensive experiments are given against multiple baseline algorithms.
- The novelty seems rather limited as the main framework of DLPA follows the standard data-driven MPPI, with the contextualization from MDP to PAMDPs. For example, an information theoretic MPC framework was proposed in [G. Williams et al, ICRA 20217] to incorporate the model learning in the MPPI-based planning procedure. - Some key information is not provided, e.g. which neural network algorithms are used to optimize Eq. 1? - No theoretical analysis is given to justify the performance applied
* The paper is generally well written. I was able to easily identify the research questions and understand the main contributions. * The empirical study is excellent. The experiments are constructed around the standard methodology which is aimed at supporting its main claim for SOTA performance in PAMDP benchmarks. The benchmarks and baselines seem reasonable and fairly chosen. The results provide positive evidence for its SOTA claim and demonstrate a significant performance margin between the p
* Many of my comments on this paper are minor. * The proposed algorithm seems limited to small-horizon problems, as its backward-pass computation and the number of parameters both seem to scale linearly in the horizon length. * The proposed algorithm has a lot of hyperparameters, which could make the algorithm difficult to tune.
The key difference between previous work and DLPA is that model learning is performed by relying just on the initial state and the actions trajectory instead of all intermediate transitions. This provides better inductive bias for longer time horizon tasks and is show to be a critical component by the ablation study in section 5.3. The reported results show that DLPA learns much faster (in terms of number of samples) compared to previous benchmarks which again can be the result of learning lon
The paper brings together PAMDPs and model predictive control in a very conventional way and so I do not find DLPA novel enough from a technical perspective. In terms of results, despite the reported sampling efficiency, even though the testing environments are used by other previous works, they still seem to be relatively simplistic even compared to other game based benchmarks such as ATARI.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
