Policy Optimization Prefers The Path of Least Resistance

Debdeep Sanyal; Aakash Sen Sharma; Dhruv Kumar; Saurabh Deshpande; Murari Mandal

arXiv:2510.21853·cs.CL·October 28, 2025

Policy Optimization Prefers The Path of Least Resistance

Debdeep Sanyal, Aakash Sen Sharma, Dhruv Kumar, Saurabh Deshpande, Murari Mandal

PDF

4 Reviews

TL;DR

This paper reveals that policy optimization in language models tends to favor the simplest reward pathways, often neglecting explicit reasoning when given flexibility, which poses challenges for alignment and complex reasoning tasks.

Contribution

The study systematically demonstrates that policy optimization prefers the path of least resistance, leading to a collapse of reasoning formats and highlighting the importance of constrained optimization for alignment.

Findings

01

PO favors simple reward components first

02

Flexible policies tend to discard explicit reasoning

03

High-reward shortcuts emerge even with increased reward weights

Abstract

Policy optimization (PO) algorithms are used to refine Large Language Models for complex, multi-step reasoning. Current state-of-the-art pipelines enforce a strict think-then-answer format to elicit chain-of-thought (CoT); however, the behavior of PO when these rigid constraints are relaxed into an open-ended CoT structure remains an under-studied question. We investigate this gap with an extensive suite of controlled experiments and identify a consistent principle: \textit{policy optimization consistently follows the path of least resistance}. When afforded the flexibility to interleave reasoning and response, policy optimization consistently learns to discard explicit reasoning, causing the policy to degenerate to a direct \texttt{<answer>}-only format. This outcome holds true across various models and algorithms. We find that this collapse in format is persistent even when the…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 3

Strengths

- The explicit inclusion of a regex in section 3.1 clarifies the target behavior (though the first regex appears to be missing a leading "^"). - The framing of rewarding formatting to probe basic optimization behavior is clean and targets core questions about PO dynamics.

Weaknesses

**Scope/fit of Table 1.** Table 1 appears only loosely related to the claimed "easy-before-hard" phenomenon. It contrasts RL accuracies under simple vs. complex prompts, but the runs are presumably trained to convergence. Presenting R_strict and R_regex as "rewards" is confusing in this context: it is unclear whether these are multiplied by task correctness during training (and, from the wording, whether the model effectively chooses between them via a max over indicators). **Underspecified "di

Reviewer 02Rating 2Confidence 4

Strengths

## Strength 1 The paper tests across five distinct model families (Gemma-3, Qwen-2.5, Llama-3.1, Ministral, Yi), spanning 4B to 24B parameters, and validates findings across three different policy optimization algorithms (DAPO, Dr. GRPO, REINFORCE++). This multi-axis validation significantly strengthens claims beyond a single-algorithm, single-model observation. Also, the paper uses established benchmarks (GSM8K, rStar-Coder, ReClor, planning mysteries) across three meaningful reasoning domains

Weaknesses

## Weakness 1 The paper only tests policy optimization algorithms for LLMs and it remains unclear whether this "preference for least resistance" generalizes to supervised fine-tuning, other RL algorithms, or other domains entirely. This severely limits the generality of the claimed "fundamental principle." In addition, the paper does not adequately address whether models collapse to simple answer formats because structured reasoning is simply harder to learn in sparse reward settings, rather th

Reviewer 03Rating 2Confidence 3

Strengths

1. The paper demonstrates that PO systematically favors simpler, easier-to-learn behaviors when multiple valid reward paths exist. 2. Results are consistent across five LLM families, three reasoning domains, and three PO algorithms.

Weaknesses

1. The method is conceptually simple but lacks technical innovation. The paper mostly reports empirical observations that may not generalize to all RL methods. For example, many works omit the format reward and only remain the accuracy reward as the reward function (e.g., Skywork OR1). 2. It would be interesting to see how the proposed method compares to LLM-as-a-judge. 3. There is no related work section in the main paper. Including it in the main text would provide essential context, clarify

Reviewer 04Rating 4Confidence 4

Strengths

The *Principle of Least Resistance* is convincingly demonstrated across a broad range of setups, including five model families (4B–24B), three reasoning domains (math, code, logic), and three PO algorithms (GRPO, DAPO, REINFORCE++). The analysis linking exploration freedom (high KL) to the discovery and exploitation of shortcuts offers an insightful perspective for diagnosing and mitigating reward hacking.

Weaknesses

While the paper clearly establishes that the principle exists and what conditions enable it (high KL), the underlying mechanism, “why simpler rewards produce stronger and more stable gradients”, is only asserted, not deeply examined. The claim that these rewards provide the “most powerful and stable gradient signal” would be much stronger with a formal analysis or empirical comparison of gradient magnitudes and variances. In Figure 2, the Reward Hierarchy experiments show that the complex rewar

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.