RL in Name Only? Analyzing the Structural Assumptions in RL post-training for LLMs
Soumya Rani Samineni, Durgesh Kalwar, Karthik Valmeekam, Kaya Stechly, Subbarao Kambhampati

TL;DR
This paper critically examines the structural assumptions underlying RL post-training of large language models, revealing they reduce to supervised fine-tuning and questioning claims of improved reasoning abilities.
Contribution
It identifies key assumptions in modeling LLM training as an MDP that lead to a degenerate formulation, showing RL post-training effectively becomes supervised fine-tuning.
Findings
Filtered Iterative SFT achieves comparable performance to RL-based methods.
Structural assumptions simplify RL to outcome-driven supervised learning.
RL incentivizes longer token sequences, affecting model behavior.
Abstract
Reinforcement learning based post-training of large language models (LLMs) has recently gained attention, particularly following the release of DeepSeek R1, which applied GRPO for fine-tuning. Amid the growing claims around improved reasoning abilities attributed to RL post-training, we critically examine the formulation and assumptions underlying these methods. We start by highlighting popular structural assumptions made in modeling LLM training as an MDP, and show how they lead to a degenerate MDP, that characterizes the problem as a contextual bandit, where RL updates naturally collapse into a form of on-policy variant of outcome-driven supervised learning. The two critical structural assumptions include (1) making the MDP states be just a concatenation of the actions with states becoming the context window and the actions becoming the tokens in LLMs and (2) splitting the reward of a…
Peer Reviews
Decision·Submitted to ICLR 2026
- The analysis and breakdown of the MDP assumptions and terminal reward assignment in RL LLM post-training was meticulous and led to very interesting conclusions - The explanation for increasing response lengths in GRPO is compelling and opens up a new avenue for exploration to fix the issue with more sophisticated solutions than naive length penalties. - Quantitative results on Qwen and Llama models and GSM8K, Countdown and MATH datasets validate the theoretical claims in the paper and providi
- Empirical results are only shown on smaller (0.5B to 3B) models. Adding additional results with larger models would add further support to claims of the paper. - A discussion on the implementation complexity and training dynamics of F-ISFT+- would improve the paper. - The removal of the KL penalty could be an over-simplification. Similarly, the assumption of binary rewards also might not hold true for all tasks, and F-ISFT is not necessarily comparable to other RL algorithms like PPO. - Whil
1. This paper is clearly written and easy to follow.
1. First of all, the authors claim that "Our comprehensive analysis demonstrates that, due to these simplifying assumptions, the standard approach is effectively equivalent to outcome-driven supervised learning". However, the validity of this claim requires further consideration. Although the derived Equation (8) looks like SFT, the expectation is taken over the distribution of the current policy, whereas SFT is done over a static dataset. RL learns over its own rollouts! 2. Limited technical no
The strengths of the paper are as follows. - First, the paper addresses a timely concern about what RL accomplishes. - Second, the mathematical derivation effectively demonstrates how structural assumptions lead to equivalence with F-ISFT. The step-by-step simplification is easy to follow. - Third, there is a comprehensive experimental setup, across several model families and sizes. - Finally, the paper tackles the root cause, rather than proposing another patch like length penalties, the paper
The weaknesses of the paper are as follows. - First, the largest model size explored was 3B, and findings may not hold for larger model sizes. - Second, there is no comparison between RL methods using proper credit assignment e.g. MTCS.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHigher Education Learning Practices · Interpreting and Communication in Healthcare · Artificial Intelligence in Law
MethodsAttention Is All You Need · RAdam · Softmax · Balanced Selection · Graph Self-Attention · Hyperboloid Embeddings
