PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play
Roger Creus Castanyer, Geoffrey Bradway, Lorenz Wolf, Maxwill Lin, Augustine N. Mavor-Parker, Matthew James Sargent

TL;DR
PopuLoRA is a population-based self-play framework for reinforcement learning with LLMs, enabling co-evolution of problem difficulty and solution capabilities, leading to improved performance on reasoning and coding benchmarks.
Contribution
It introduces a novel population-based asymmetric self-play method with LoRA weight-space evolution operators for training LLMs in a co-evolutionary setting.
Findings
Population outperforms single-agent baseline on multiple benchmarks.
Co-evolution leads to increasingly complex problems and diverse solutions.
Weakest population member surpasses baseline performance on aggregate.
Abstract
We introduce PopuLoRA, a population-based asymmetric self-play framework for reinforcement learning with verifiable rewards (RLVR) post-training of LLMs. Teachers and students are specialised LoRA adapters on a shared frozen base: teachers propose problems, matched students solve them under a programmatic verifier, and cross-evaluation between sub-populations replaces the self-calibration that limits single-agent self-play. A family of LoRA weight-space evolution operators (mutations and crossovers that produce same-rank population members in seconds) serves as the replacement step of a population-based training loop at 7B scale. We instantiate PopuLoRA on top of Absolute Zero Reasoner and compare it against a per-adapter compute-matched single-agent baseline. Where the single agent self-calibrates to generating easy problems it can reliably solve, the population enters a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
