Reshaping Reasoning in LLMs: A Theoretical Analysis of RL Training Dynamics through Pattern Selection

Xingwu Chen; Tianle Li; Difan Zou

arXiv:2506.04695·cs.LG·September 30, 2025

Reshaping Reasoning in LLMs: A Theoretical Analysis of RL Training Dynamics through Pattern Selection

Xingwu Chen, Tianle Li, Difan Zou

PDF

Open Access 3 Reviews

TL;DR

This paper provides a theoretical and empirical analysis of reinforcement learning training dynamics in large language models, revealing how RL reshapes reasoning patterns and highlighting the importance of reward types and initial model quality.

Contribution

It offers a novel theoretical framework for understanding RL training in LLMs, focusing on pattern selection and the effects of different reward signals.

Findings

01

RL primarily optimizes critical tokens, reshaping reasoning patterns.

02

Model's initial reasoning quality influences convergence behavior.

03

Internal rewards can improve or degrade performance over training.

Abstract

While reinforcement learning (RL) demonstrated remarkable success in enhancing the reasoning capabilities of language models, the training dynamics of RL in LLMs remain unclear. In this work, we provide an explanation of the RL training process through empirical analysis and rigorous theoretical modeling. First, through systematic reasoning-pattern-level and token-level analysis across the RL training process, we show that while different reasoning patterns exhibit relatively stable success rates during training, RL primarily optimizes a sparse subset of critical tokens, thereby reshaping reasoning pattern distributions to affect model performance. Building on these empirical insights, we develop a theoretical framework to understand the training dynamics of RL with two typical rewards: verifiable reward (RLVR) and model's internal feedback (RLIF). For RLVR, we analyze the training…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

- Provides an explanation for the stability of RLVR and the instability of RLIF - Relevant topic: Understanding the behavior of RL training in the context of LLMs - Multiple datasets and model scales are used which strengthens the generality of the findings. - Introduces an interesting and interpretable analysis framework at both the token and reasoning-pattern levels

Weaknesses

- The paper would benefit from a more comprehensive description of the experimental setup for the different training/evaluation runs. Important implementation details (e.g. temperature settings, sampling strategies, and other hyper parameters) are missing making it difficult to fully reproduce or interpret the reported results. - It remains unclear which exact implementation was used for RLVR training. For example, if vanilla GRPO was used, information about rollout numbers, optimization paramet

Reviewer 02Rating 6Confidence 3

Strengths

The paper effectively combines empirical findings with a clean theoretical analysis, offering useful insights for understanding LLM reasoning behavior.

Weaknesses

The experimental scope is narrow: all evaluations are conducted on math reasoning tasks and within the Qwen2.5 model family. Prior work has shown RLIF to produce gains on non-math general domains, so broader experiments across model families and tasks are necessary to substantiate the generality of the conclusions.

Reviewer 03Rating 2Confidence 3

Strengths

1, In section 4.2, this work identifies patterns in LLM mathematical reasoning. This could provide inspire further explorations. 2, In section 5, basing on the assumptions, theoretical results are derived for RL training dynamics. Furthermore, these claims are supported by simulations in the experiment section.

Weaknesses

1, My biggest concern is with the assumptions. Specifically, - Assumption 5.1 assumes that success rate for each pattern remains the same. But this assumption is only demonstrated on one dataset; and there is not discussion on why this assumption is reasonable; - Assumption 5.5 assumes that correct answer $a*$ has the highest probability across all possible answers, for all answers. I do not think this claim is well supported by Wang et al. (2022). - In equation (5.3), the LLM is simplified as

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multi-Agent Systems and Negotiation · AI-based Problem Solving and Planning

MethodsShrink and Fine-Tune · ADaptive gradient method with the OPTimal convergence rate