POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles
Nicolas Menet, Andreas Krause, Abbas Rahimi

TL;DR
POETS introduces an efficient, uncertainty-aware policy ensemble framework for large language model optimization, achieving state-of-the-art results in scientific discovery and reinforcement learning tasks.
Contribution
It presents a novel compute-efficient ensemble architecture that directly captures epistemic uncertainty without complex reward modeling, with theoretical regret guarantees.
Findings
Achieves state-of-the-art sample efficiency in scientific discovery domains.
Improves reinforcement learning optimization trajectories, especially off-policy.
Uses shared backbone with independent LoRA branches for diversity.
Abstract
Balancing exploration and exploitation is a core challenge in sequential decision-making and black-box optimization. We introduce POETS (licy nsembles for hompson ampling), a novel framework that bridges uncertainty quantification and policy optimization. Our approach is grounded in the insight that policies trained with Kullback-Leibler (KL) regularization implicitly encode an underlying reward function. Building on this, POETS bypasses the complex, nested process of training an uncertainty-aware reward model and separately fitting a policy to this model. Instead, we directly train a policy ensemble to capture epistemic uncertainty by matching implicitly encoded reward functions to online, bootstrapped data. To overcome the prohibitive compute and memory constraints of ensembling Large Language Models (LLMs), POETS utilizes an efficient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
