Reshaping Reasoning in LLMs: A Theoretical Analysis of RL Training Dynamics through Pattern Selection
Xingwu Chen, Tianle Li, Difan Zou

TL;DR
This paper provides a theoretical and empirical analysis of reinforcement learning training dynamics in large language models, revealing how RL reshapes reasoning patterns and highlighting the importance of reward types and initial model quality.
Contribution
It offers a novel theoretical framework for understanding RL training in LLMs, focusing on pattern selection and the effects of different reward signals.
Findings
RL primarily optimizes critical tokens, reshaping reasoning patterns.
Model's initial reasoning quality influences convergence behavior.
Internal rewards can improve or degrade performance over training.
Abstract
While reinforcement learning (RL) demonstrated remarkable success in enhancing the reasoning capabilities of language models, the training dynamics of RL in LLMs remain unclear. In this work, we provide an explanation of the RL training process through empirical analysis and rigorous theoretical modeling. First, through systematic reasoning-pattern-level and token-level analysis across the RL training process, we show that while different reasoning patterns exhibit relatively stable success rates during training, RL primarily optimizes a sparse subset of critical tokens, thereby reshaping reasoning pattern distributions to affect model performance. Building on these empirical insights, we develop a theoretical framework to understand the training dynamics of RL with two typical rewards: verifiable reward (RLVR) and model's internal feedback (RLIF). For RLVR, we analyze the training…
Peer Reviews
Decision·ICLR 2026 Poster
- Provides an explanation for the stability of RLVR and the instability of RLIF - Relevant topic: Understanding the behavior of RL training in the context of LLMs - Multiple datasets and model scales are used which strengthens the generality of the findings. - Introduces an interesting and interpretable analysis framework at both the token and reasoning-pattern levels
- The paper would benefit from a more comprehensive description of the experimental setup for the different training/evaluation runs. Important implementation details (e.g. temperature settings, sampling strategies, and other hyper parameters) are missing making it difficult to fully reproduce or interpret the reported results. - It remains unclear which exact implementation was used for RLVR training. For example, if vanilla GRPO was used, information about rollout numbers, optimization paramet
The paper effectively combines empirical findings with a clean theoretical analysis, offering useful insights for understanding LLM reasoning behavior.
The experimental scope is narrow: all evaluations are conducted on math reasoning tasks and within the Qwen2.5 model family. Prior work has shown RLIF to produce gains on non-math general domains, so broader experiments across model families and tasks are necessary to substantiate the generality of the conclusions.
1, In section 4.2, this work identifies patterns in LLM mathematical reasoning. This could provide inspire further explorations. 2, In section 5, basing on the assumptions, theoretical results are derived for RL training dynamics. Furthermore, these claims are supported by simulations in the experiment section.
1, My biggest concern is with the assumptions. Specifically, - Assumption 5.1 assumes that success rate for each pattern remains the same. But this assumption is only demonstrated on one dataset; and there is not discussion on why this assumption is reasonable; - Assumption 5.5 assumes that correct answer $a*$ has the highest probability across all possible answers, for all answers. I do not think this claim is well supported by Wang et al. (2022). - In equation (5.3), the LLM is simplified as
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multi-Agent Systems and Negotiation · AI-based Problem Solving and Planning
MethodsShrink and Fine-Tune · ADaptive gradient method with the OPTimal convergence rate
