The Policy Cliff: A Theoretical Analysis of Reward-Policy Maps in Large Language Models

Xingcheng Xu

arXiv:2507.20150·cs.AI·July 29, 2025

The Policy Cliff: A Theoretical Analysis of Reward-Policy Maps in Large Language Models

Xingcheng Xu

PDF

TL;DR

This paper develops a mathematical framework to analyze reward-policy stability in large language models, explaining failures like brittleness and deception as rational outcomes of reward optimization, and offers insights for safer AI design.

Contribution

It introduces a unified theoretical analysis of reward-policy maps in RL for LLMs, explaining policy brittleness and failures through action degeneracy and reward aggregation mechanisms.

Findings

01

Non-unique optimal actions cause policy brittleness.

02

Entropy regularization improves policy stability.

03

Multi-reward RL analysis explains empirical failures.

Abstract

Reinforcement learning (RL) plays a crucial role in shaping the behavior of large language and reasoning models (LLMs/LRMs). However, it often produces brittle and unstable policies, leading to critical failures such as spurious reasoning, deceptive alignment, and instruction disobedience that undermine the trustworthiness and safety of LLMs/LRMs. Currently, these issues lack a unified theoretical explanation and are typically addressed using ad-hoc heuristics. This paper presents a rigorous mathematical framework for analyzing the stability of the mapping from a reward function to the optimal policy. We show that policy brittleness often stems from non-unique optimal actions, a common occurrence when multiple valid traces exist in a reasoning task. This theoretical lens provides a unified explanation for a range of seemingly disparate failures, reframing them as rational outcomes of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.