Winner Takes It All: Training Performant RL Populations for Combinatorial Optimization
Nathan Grinsztajn, Daniel Furelos-Blanco, Shikha Surana, Cl\'ement, Bonnet, Thomas D. Barrett

TL;DR
This paper introduces Poppy, a training method for populations of reinforcement learning policies that specialize complementarily, achieving state-of-the-art results on multiple NP-hard combinatorial optimization problems.
Contribution
Poppy is a simple, unsupervised training procedure that induces diverse, complementary policies without predefined diversity notions, improving RL performance on complex problems.
Findings
Poppy produces complementary policy sets.
Achieves state-of-the-art RL results on four NP-hard problems.
Outperforms existing methods in combinatorial optimization.
Abstract
Applying reinforcement learning (RL) to combinatorial optimization problems is attractive as it removes the need for expert knowledge or pre-solved instances. However, it is unrealistic to expect an agent to solve these (often NP-)hard problems in a single shot at inference due to their inherent complexity. Thus, leading approaches often implement additional search strategies, from stochastic sampling and beam search to explicit fine-tuning. In this paper, we argue for the benefits of learning a population of complementary policies, which can be simultaneously rolled out at inference. To this end, we introduce Poppy, a simple training procedure for populations. Instead of relying on a predefined or hand-crafted notion of diversity, Poppy induces an unsupervised specialization targeted solely at maximizing the performance of the population. We show that Poppy produces a set of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsVehicle Routing Optimization Methods · Metaheuristic Optimization Algorithms Research · Auction Theory and Applications
