EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training

Chengjun Pan; Shichun Liu; Jiahang Lin; Dingwei Zhu; Jiazheng Zhang; Shihan Dou; Songyang Gao; Zhenhua Han; Binghai Wang; Rui Zheng; Xuanjing Huang; Tao Gui; Yansong Feng

arXiv:2604.19485·cs.LG·April 22, 2026

EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training

Chengjun Pan, Shichun Liu, Jiahang Lin, Dingwei Zhu, Jiazheng Zhang, Shihan Dou, Songyang Gao, Zhenhua Han, Binghai Wang, Rui Zheng, Xuanjing Huang, Tao Gui, Yansong Feng

PDF

TL;DR

This paper introduces EVPO, an adaptive policy optimization method for LLM post-training that dynamically switches between critic-based and mean advantage estimation based on explained variance, improving stability and performance.

Contribution

The paper unifies critic-based and critic-free RL methods through a Kalman filtering perspective and proposes EVPO, which adaptively chooses the best baseline to minimize variance during training.

Findings

01

EVPO outperforms PPO and GRPO across multiple tasks.

02

Batch-level explained variance effectively guides baseline switching.

03

The zero EV threshold is empirically optimal for variance control.

Abstract

Reinforcement learning (RL) for LLM post-training faces a fundamental design choice: whether to use a learned critic as a baseline for policy optimization. Classical theory favors critic-based methods such as PPO for variance reduction, yet critic-free alternatives like GRPO have gained widespread adoption due to their simplicity and competitive performance. We show that in sparse-reward settings, a learned critic can inject estimation noise that exceeds the state signal it captures, increasing rather than reducing advantage variance. By casting baseline selection as a Kalman filtering problem, we unify PPO and GRPO as two extremes of the Kalman gain and prove that explained variance (EV), computable from a single training batch, identifies the exact boundary: positive EV indicates the critic reduces variance, while zero or negative EV signals that it inflates variance. Building on this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.