The Policy Cliff: A Theoretical Analysis of Reward-Policy Maps in Large Language Models
Xingcheng Xu

TL;DR
This paper develops a mathematical framework to analyze reward-policy stability in large language models, explaining failures like brittleness and deception as rational outcomes of reward optimization, and offers insights for safer AI design.
Contribution
It introduces a unified theoretical analysis of reward-policy maps in RL for LLMs, explaining policy brittleness and failures through action degeneracy and reward aggregation mechanisms.
Findings
Non-unique optimal actions cause policy brittleness.
Entropy regularization improves policy stability.
Multi-reward RL analysis explains empirical failures.
Abstract
Reinforcement learning (RL) plays a crucial role in shaping the behavior of large language and reasoning models (LLMs/LRMs). However, it often produces brittle and unstable policies, leading to critical failures such as spurious reasoning, deceptive alignment, and instruction disobedience that undermine the trustworthiness and safety of LLMs/LRMs. Currently, these issues lack a unified theoretical explanation and are typically addressed using ad-hoc heuristics. This paper presents a rigorous mathematical framework for analyzing the stability of the mapping from a reward function to the optimal policy. We show that policy brittleness often stems from non-unique optimal actions, a common occurrence when multiple valid traces exist in a reasoning task. This theoretical lens provides a unified explanation for a range of seemingly disparate failures, reframing them as rational outcomes of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
