UCPO: Uncertainty-Aware Policy Optimization

Xianzhou Zeng; Jing Huang; Chunmei Xie; Gongrui Nan; Siye Chen; Mengyu Lu; Weiqi Xiong; Qixuan Zhou; Junhao Zhang; Qiang Zhu; Yadong Li; Xingzhong Xu

arXiv:2601.22648·cs.AI·February 2, 2026

UCPO: Uncertainty-Aware Policy Optimization

Xianzhou Zeng, Jing Huang, Chunmei Xie, Gongrui Nan, Siye Chen, Mengyu Lu, Weiqi Xiong, Qixuan Zhou, Junhao Zhang, Qiang Zhu, Yadong Li, Xingzhong Xu

PDF

Open Access

TL;DR

This paper introduces UCPO, a novel reinforcement learning framework that enhances the reliability and calibration of large language models by addressing advantage bias and dynamically adjusting uncertainty rewards.

Contribution

UCPO proposes Ternary Advantage Decoupling and Dynamic Uncertainty Reward Adjustment to improve uncertainty handling in RL for LLMs, reducing bias and overconfidence.

Findings

01

UCPO outperforms existing methods in mathematical reasoning tasks.

02

It significantly improves model calibration and reliability.

03

The framework effectively balances reward signals in uncertain environments.

Abstract

The key to building trustworthy Large Language Models (LLMs) lies in endowing them with inherent uncertainty expression capabilities to mitigate the hallucinations that restrict their high-stakes applications. However, existing RL paradigms such as GRPO often suffer from Advantage Bias due to binary decision spaces and static uncertainty rewards, inducing either excessive conservatism or overconfidence. To tackle this challenge, this paper unveils the root causes of reward hacking and overconfidence in current RL paradigms incorporating uncertainty-based rewards, based on which we propose the UnCertainty-Aware Policy Optimization (UCPO) framework. UCPO employs Ternary Advantage Decoupling to separate and independently normalize deterministic and uncertain rollouts, thereby eliminating advantage bias. Furthermore, a Dynamic Uncertainty Reward Adjustment mechanism is introduced to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Big Data and Digital Economy