Simultaneous Training of First- and Second-Order Optimizers in Population-Based Reinforcement Learning
Felix Pfeiffer, Shahram Eivazi

TL;DR
This paper introduces a novel approach to population-based reinforcement learning by simultaneously training first- and second-order optimizers, leading to improved performance and stability across various environments.
Contribution
It is the first to empirically demonstrate the benefits of integrating second-order optimizers like K-FAC into population-based RL training.
Findings
Up to 10% performance improvement with combined optimizers.
Enhanced training stability in challenging environments.
Reliable learning outcomes with mixed optimizer populations.
Abstract
The tuning of hyperparameters in reinforcement learning (RL) is critical, as these parameters significantly impact an agent's performance and learning efficiency. Dynamic adjustment of hyperparameters during the training process can significantly enhance both the performance and stability of learning. Population-based training (PBT) provides a method to achieve this by continuously tuning hyperparameters throughout the training. This ongoing adjustment enables models to adapt to different learning stages, resulting in faster convergence and overall improved performance. In this paper, we propose an enhancement to PBT by simultaneously utilizing both first- and second-order optimizers within a single population. We conducted a series of experiments using the TD3 algorithm across various MuJoCo environments. Our results, for the first time, empirically demonstrate the potential of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEvolutionary Algorithms and Applications · Metaheuristic Optimization Algorithms Research · Reinforcement Learning in Robotics
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Dense Connections · Target Policy Smoothing · Clipped Double Q-learning · Experience Replay · Adam · Twin Delayed Deep Deterministic
