TL;DR
This paper introduces $V_0$, a versatile value model that predicts LLM performance on unseen prompts without retraining, improving efficiency in model training and deployment.
Contribution
It proposes a novel context-based value estimation approach at State Zero, eliminating the need for parameter updates and enabling effective LLM routing.
Findings
$V_0$ outperforms heuristic budget allocation methods.
Achieves Pareto-optimal trade-off between performance and cost.
Enables efficient model routing without frequent retraining.
Abstract
Policy gradient methods rely on a baseline to measure the relative advantage of an action, ensuring the model reinforces behaviors that outperform its current average capability. In the training of Large Language Models (LLMs) using Actor-Critic methods (e.g., PPO), this baseline is typically estimated by a Value Model (Critic) often as large as the policy model itself. However, as the policy continuously evolves, the value model requires expensive, synchronous incremental training to accurately track the shifting capabilities of the policy. To avoid this overhead, Group Relative Policy Optimization (GRPO) eliminates the coupled value model by using the average reward of a group of rollouts as the baseline; yet, this approach necessitates extensive sampling to maintain estimation stability. In this paper, we propose , a Generalist Value Model capable of estimating the expected…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
