Maximum Entropy Population-Based Training for Zero-Shot Human-AI Coordination
Rui Zhao, Jinming Song, Yufeng Yuan, Hu Haifeng, Yang Gao, Yi Wu,, Zhongqian Sun, Yang Wei

TL;DR
This paper introduces Maximum Entropy Population-based training (MEP), a method to train RL agents that collaborate effectively with humans without human data, by promoting diversity and mitigating distributional shift.
Contribution
The paper proposes MEP, a novel training approach that enhances human-AI collaboration by maintaining diversity in agent populations and dynamically prioritizing training partners.
Findings
MEP outperforms existing methods like SP, PBT, TrajeDi, and FCP in Overcooked.
Agents trained with MEP show improved robustness with human partners.
Diversity promotion reduces distributional shift in human-AI collaboration.
Abstract
We study the problem of training a Reinforcement Learning (RL) agent that is collaborative with humans without using any human data. Although such agents can be obtained through self-play training, they can suffer significantly from distributional shift when paired with unencountered partners, such as humans. To mitigate this distributional shift, we propose Maximum Entropy Population-based training (MEP). In MEP, agents in the population are trained with our derived Population Entropy bonus to promote both pairwise diversity between agents and individual diversity of agents themselves, and a common best agent is trained by paring with agents in this diversified population via prioritized sampling. The prioritization is dynamically adjusted based on the training progress. We demonstrate the effectiveness of our method MEP, with comparison to Self-Play PPO (SP), Population-Based Training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics
MethodsEntropy Regularization · Proximal Policy Optimization
