OptPO: Optimal Rollout Allocation for Test-time Policy Optimization

Youkang Wang; Jian Wang; Rubing Chen; Tianyi Zeng; Xiao-Yong Wei; Qing Li

arXiv:2512.02882·cs.LG·December 3, 2025

OptPO: Optimal Rollout Allocation for Test-time Policy Optimization

Youkang Wang, Jian Wang, Rubing Chen, Tianyi Zeng, Xiao-Yong Wei, Qing Li

PDF

Open Access

TL;DR

OptPO introduces an adaptive, Bayesian-based method for test-time policy optimization that reduces computational overhead while maintaining or improving accuracy in large language models.

Contribution

It presents a novel Bayesian sequential testing framework for adaptive rollout allocation, seamlessly integrating with existing algorithms without ground-truth labels.

Findings

01

Significantly reduces rollout overhead in diverse benchmarks

02

Maintains or improves accuracy compared to fixed-sample methods

03

Provides a unified, statistically optimal stopping approach for test-time learning

Abstract

Test-time policy optimization enables large language models (LLMs) to adapt to distribution shifts by leveraging feedback from self-generated rollouts. However, existing methods rely on fixed-budget majority voting to estimate rewards, incurring substantial computational redundancy. We propose Optimal Rollout Allocation for Test-time Policy Optimization (OptPO), a principled framework that adaptively allocates inference budgets. By formulating the voting process as a Bayesian sequential probability ratio test, OptPO dynamically halts sampling once the posterior confidence in a consensus answer exceeds a specified threshold. Crucially, it utilizes the retained rollouts for on-policy updates, seamlessly integrating with algorithms like PPO or GRPO without requiring ground-truth labels. Across diverse reasoning benchmarks, OptPO significantly reduces rollout overhead compared to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Machine Learning and Algorithms