Multiple-Frequencies Population-Based Training

Wa\"el Doulazmi; Auguste Lehuger; Marin Toromanoff; Valentin Charraut; Thibault Buhet; Fabien Moutarde

arXiv:2506.03225·cs.LG·July 18, 2025

Multiple-Frequencies Population-Based Training

Wa\"el Doulazmi, Auguste Lehuger, Marin Toromanoff, Valentin Charraut, Thibault Buhet, Fabien Moutarde

PDF

Open Access 3 Reviews

TL;DR

This paper introduces MF-PBT, a novel hyperparameter optimization method that uses multiple evolution frequencies and migration between sub-populations to improve reinforcement learning performance and avoid local optima.

Contribution

MF-PBT is the first approach to employ multiple evolution frequencies and migration in population-based training for better long-term optimization.

Findings

01

MF-PBT outperforms standard PBT and random search in sample efficiency.

02

MF-PBT achieves better long-term performance on Brax benchmarks.

03

Using multiple frequencies helps balance short-term gains and long-term progress.

Abstract

Reinforcement Learning's high sensitivity to hyperparameters is a source of instability and inefficiency, creating significant challenges for practitioners. Hyperparameter Optimization (HPO) algorithms have been developed to address this issue, among them Population-Based Training (PBT) stands out for its ability to generate hyperparameters schedules instead of fixed configurations. PBT trains a population of agents, each with its own hyperparameters, frequently ranking them and replacing the worst performers with mutations of the best agents. These intermediate selection steps can cause PBT to focus on short-term improvements, leading it to get stuck in local optima and eventually fall behind vanilla Random Search over longer timescales. This paper studies how this greediness issue is connected to the choice of evolution frequency, the rate at which the selection is done. We propose…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 4

Strengths

1. The idea of using multiple frequencies to mitigate PBT’s greediness is innovative and addresses a significant limitation in existing methods. 2. The asymmetric migration process is a practical solution that could be applied to other population-based methods to improve their performance such as PB2.

Weaknesses

1. All experiments were conducted on several locomotion tasks within the reinforcement learning context, which I believe is unacceptable. HPO has too many application scenarios, and I strongly recommend increasing the variety, such as the experiments in PB2 and PB2-Mix. 2. Other HPO methods (such as Bayesian optimization, e.g.,[1-2] ) are also recommended for comparison. Their experiments are also should be considered to run. [1] Deep power laws for hyperparameter optimization. [2] In-Contex

Reviewer 02Rating 8Confidence 4

Strengths

1. Novel and well-motivated idea of having multiple evolutionary frequencies within a population, the issue of hyperparameter sensitvity in Reinforcement Learning is well known and the paper makes a good attempt at addressing it. 2. Strong empirical validation showing benefits over standard PBT and a smart use of random search to make sure greediness is avoided 3. Important insight about separating parameter and hyperparameter migration to prevent collapse are substantiated and that is a useful

Weaknesses

## Major Weaknesses 1. The evaluation and definition of "anytime performance" would benefit from formalization increasing the notion's rigor. 2. Current results don't fully demonstrate if better early performance is due to high-frequency evolution or other factors such as good performance in the beginnings of training. 3. Limited exploration of different frequency spreads and not given enough weight in the main results (only two tested in supplementary section which are important to explore) 4

Reviewer 03Rating 6Confidence 5

Strengths

The work addresses an inherent issue of population based training style hyperparameter optimization. It clearly communicates the cause of these shortcomings and uses them to motivate the proposed changes to the PBT regime. The work aims to provide an extensive empirical analysis based on the speedups gained by the Brax framework.

Weaknesses

The work does not do a good job showing that their proposed sub-population scheme actually serves to improve the PBT training regime. In particular, the work identifies the frequency with which PBT calls the exploit-explore/mutation step as the root cause of PBTs inherent greedyness. However, I believe this ignores interaction effects of multiple hyperparameters in the PBT regime. First and foremost, the population size and the selection criteria play a crucial role. With a much larger populatio

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Thermodynamic Systems and Engines