Post-Training with Policy Gradients: Optimality and the Base Model Barrier

Alireza Mousavi-Hosseini; Murat A. Erdogdu

arXiv:2603.06957·stat.ML·March 10, 2026

Post-Training with Policy Gradients: Optimality and the Base Model Barrier

Alireza Mousavi-Hosseini, Murat A. Erdogdu

PDF

Open Access

TL;DR

This paper analyzes the effectiveness of post-training policy gradient methods for autoregressive models, identifying optimal conditions and barriers related to model support and proposing solutions using process rewards.

Contribution

It introduces a theoretical framework for post-training with policy gradients, revealing barriers beyond the base model support and proposing process rewards to overcome these limitations.

Findings

01

Policy gradient can achieve near-perfect likelihood with minimal reward queries on test samples.

02

A barrier exists for going beyond the base model support, requiring exponential queries.

03

Using process rewards, PG variants can avoid the curse of dimensionality in sequence length.

Abstract

We study post-training linear autoregressive models with outcome and process rewards. Given a context $x$ , the model must predict the response $y \in Y^{N}$ , a sequence of length $N$ that satisfies a $γ$ margin condition, an extension of the standard separability to sequences. We prove that on test samples where the base model achieves a non-trivial likelihood $α$ , a variant of policy gradient (PG) can achieve likelihood $1 - ε$ with an essentially minimax optimal number of reward queries $\tilde{O} ((α^{- 1} + ε^{- 1}) / γ^{2})$ . However, a barrier arises for going beyond the support of the base model. We prove that the overall expected error after post-training with outcome rewards is governed by a property of the base model called the Likelihood Quantile (LQ), and that variants of PG, while minimax optimal, may require a number…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Algorithms · Stochastic Gradient Optimization Techniques