Poly-EPO: Training Exploratory Reasoning Models
Ifdita Hasan Orney, Jubayer Ibn Hamid, Shreya S Ramanujam, Shirley Wu, Hengyuan Hu, Noah Goodman, Dorsa Sadigh, Chelsea Finn

TL;DR
Poly-EPO introduces a novel training framework for language models that explicitly promotes exploratory reasoning, leading to better generalization and diversity in complex reasoning tasks.
Contribution
The paper develops a set RL-based training method and introduces Poly-EPO, a new objective that enhances exploration and exploitation synergy in language models.
Findings
Poly-EPO improves reasoning benchmark performance.
Models trained with Poly-EPO generate more diverse responses.
Poly-EPO scales effectively with increased test-time compute.
Abstract
Exploration is a cornerstone of learning from experience: it enables agents to find solutions to complex problems, generalize to novel ones, and scale performance with test-time compute. In this paper, we present a framework for post-training language models (LMs) that explicitly encourages optimistic exploration and promotes a synergy between exploration and exploitation. The central idea is to train the LM to generate sets of responses that are collectively accurate under the reward function and exploratory in their reasoning strategies. We first develop a general recipe for optimizing LMs with set reinforcement learning (set RL) under arbitrary objective functions, showing how standard RL algorithms can be adapted to this setting through a modification to the advantage computation. We then propose Polychromic Exploratory Policy Optimization (Poly-EPO), which instantiates this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
