Post-Training with Policy Gradients: Optimality and the Base Model Barrier
Alireza Mousavi-Hosseini, Murat A. Erdogdu

TL;DR
This paper analyzes the effectiveness of post-training policy gradient methods for autoregressive models, identifying optimal conditions and barriers related to model support and proposing solutions using process rewards.
Contribution
It introduces a theoretical framework for post-training with policy gradients, revealing barriers beyond the base model support and proposing process rewards to overcome these limitations.
Findings
Policy gradient can achieve near-perfect likelihood with minimal reward queries on test samples.
A barrier exists for going beyond the base model support, requiring exponential queries.
Using process rewards, PG variants can avoid the curse of dimensionality in sequence length.
Abstract
We study post-training linear autoregressive models with outcome and process rewards. Given a context , the model must predict the response , a sequence of length that satisfies a margin condition, an extension of the standard separability to sequences. We prove that on test samples where the base model achieves a non-trivial likelihood , a variant of policy gradient (PG) can achieve likelihood with an essentially minimax optimal number of reward queries . However, a barrier arises for going beyond the support of the base model. We prove that the overall expected error after post-training with outcome rewards is governed by a property of the base model called the Likelihood Quantile (LQ), and that variants of PG, while minimax optimal, may require a number…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Algorithms · Stochastic Gradient Optimization Techniques
