On the Effect of Regularization in Policy Mirror Descent
Jan Felix Kleuker, Aske Plaat, Thomas Moerland

TL;DR
This paper empirically investigates how the two regularization components in Policy Mirror Descent influence stability and robustness in reinforcement learning, emphasizing their combined importance for optimal performance.
Contribution
It provides the first large-scale empirical analysis of the interaction between trust region and reward regularizers in PMD, highlighting their combined effect on robustness.
Findings
Both regularizers can partially substitute each other.
Precise combination of regularizers is critical for robustness.
Hyperparameter sensitivity impacts RL performance.
Abstract
Policy Mirror Descent (PMD) has emerged as a unifying framework in reinforcement learning (RL) by linking policy gradient methods with a first-order optimization method known as mirror descent. At its core, PMD incorporates two key regularization components: (i) a distance term that enforces a trust region for stable policy updates and (ii) an MDP regularizer that augments the reward function to promote structure and robustness. While PMD has been extensively studied in theory, empirical investigations remain scarce. This work provides a large-scale empirical analysis of the interplay between these two regularization techniques, running over 500k training seeds on small RL environments. Our results demonstrate that, although the two regularizers can partially substitute each other, their precise combination is critical for achieving robust performance. These findings highlight the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
