RL in Name Only? Analyzing the Structural Assumptions in RL post-training for LLMs

Soumya Rani Samineni; Durgesh Kalwar; Karthik Valmeekam; Kaya Stechly; Subbarao Kambhampati

arXiv:2505.13697·cs.LG·February 5, 2026

RL in Name Only? Analyzing the Structural Assumptions in RL post-training for LLMs

Soumya Rani Samineni, Durgesh Kalwar, Karthik Valmeekam, Kaya Stechly, Subbarao Kambhampati

PDF

Open Access 3 Reviews

TL;DR

This paper critically examines the structural assumptions underlying RL post-training of large language models, revealing they reduce to supervised fine-tuning and questioning claims of improved reasoning abilities.

Contribution

It identifies key assumptions in modeling LLM training as an MDP that lead to a degenerate formulation, showing RL post-training effectively becomes supervised fine-tuning.

Findings

01

Filtered Iterative SFT achieves comparable performance to RL-based methods.

02

Structural assumptions simplify RL to outcome-driven supervised learning.

03

RL incentivizes longer token sequences, affecting model behavior.

Abstract

Reinforcement learning based post-training of large language models (LLMs) has recently gained attention, particularly following the release of DeepSeek R1, which applied GRPO for fine-tuning. Amid the growing claims around improved reasoning abilities attributed to RL post-training, we critically examine the formulation and assumptions underlying these methods. We start by highlighting popular structural assumptions made in modeling LLM training as an MDP, and show how they lead to a degenerate MDP, that characterizes the problem as a contextual bandit, where RL updates naturally collapse into a form of on-policy variant of outcome-driven supervised learning. The two critical structural assumptions include (1) making the MDP states be just a concatenation of the actions with states becoming the context window and the actions becoming the tokens in LLMs and (2) splitting the reward of a…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

- The analysis and breakdown of the MDP assumptions and terminal reward assignment in RL LLM post-training was meticulous and led to very interesting conclusions - The explanation for increasing response lengths in GRPO is compelling and opens up a new avenue for exploration to fix the issue with more sophisticated solutions than naive length penalties. - Quantitative results on Qwen and Llama models and GSM8K, Countdown and MATH datasets validate the theoretical claims in the paper and providi

Weaknesses

- Empirical results are only shown on smaller (0.5B to 3B) models. Adding additional results with larger models would add further support to claims of the paper. - A discussion on the implementation complexity and training dynamics of F-ISFT+- would improve the paper. - The removal of the KL penalty could be an over-simplification. Similarly, the assumption of binary rewards also might not hold true for all tasks, and F-ISFT is not necessarily comparable to other RL algorithms like PPO. - Whil

Reviewer 02Rating 2Confidence 5

Strengths

1. This paper is clearly written and easy to follow.

Weaknesses

1. First of all, the authors claim that "Our comprehensive analysis demonstrates that, due to these simplifying assumptions, the standard approach is effectively equivalent to outcome-driven supervised learning". However, the validity of this claim requires further consideration. Although the derived Equation (8) looks like SFT, the expectation is taken over the distribution of the current policy, whereas SFT is done over a static dataset. RL learns over its own rollouts! 2. Limited technical no

Reviewer 03Rating 6Confidence 3

Strengths

The strengths of the paper are as follows. - First, the paper addresses a timely concern about what RL accomplishes. - Second, the mathematical derivation effectively demonstrates how structural assumptions lead to equivalence with F-ISFT. The step-by-step simplification is easy to follow. - Third, there is a comprehensive experimental setup, across several model families and sizes. - Finally, the paper tackles the root cause, rather than proposing another patch like length penalties, the paper

Weaknesses

The weaknesses of the paper are as follows. - First, the largest model size explored was 3B, and findings may not hold for larger model sizes. - Second, there is no comparison between RL methods using proper credit assignment e.g. MTCS.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHigher Education Learning Practices · Interpreting and Communication in Healthcare · Artificial Intelligence in Law

MethodsAttention Is All You Need · RAdam · Softmax · Balanced Selection · Graph Self-Attention · Hyperboloid Embeddings