Vector Policy Optimization: Training for Diversity Improves Test-Time Search
Ryan Bahlous-Boldi, Isha Puri, Idan Shenfeld, Akarsh Kumar, Mehul Damani, Sebastian Risi, Omar Khattab, Zhang-Wei Hong, Pulkit Agrawal

TL;DR
Vector Policy Optimization (VPO) is a novel RL algorithm that trains language models to produce diverse solutions by anticipating multiple reward functions, enhancing test-time search performance.
Contribution
VPO introduces a way to explicitly train models for diversity by leveraging vector-valued rewards, outperforming scalar RL methods in search tasks.
Findings
VPO matches or exceeds scalar RL baselines on test-time search.
VPO enables solving problems that scalar RL models cannot.
Diversity optimization may become standard in post-training objectives.
Abstract
Language models must now generalize out of the box to novel environments and work inside inference-scaling search procedures, such as AlphaEvolve, that select rollouts with a variety of task-specific reward functions. Unfortunately, the standard paradigm of LLM post-training optimizes a pre-specified scalar reward, often leading current LLMs to produce low-entropy response distributions and thus to struggle at displaying the diversity that inference-time search will require. We propose Vector Policy Optimization (VPO), an RL algorithm that explicitly trains policies to anticipate diverse downstream reward functions and to produce diverse solutions. VPO exploits that rewards are often vector-valued in practice, like per-test-case correctness in code generation or, say, multiple different user personas or reward models. VPO is essentially a drop-in replacement for the GRPO advantage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
