Why Does Reinforcement Learning Generalize? A Feature-Level Mechanistic Study of Post-Training in Large Language Models

Dan Shi; Zhuowen Han; Simon Ostermann; Renren Jin; Josef van Genabith; Deyi Xiong

arXiv:2604.25011·cs.CL·April 29, 2026

Why Does Reinforcement Learning Generalize? A Feature-Level Mechanistic Study of Post-Training in Large Language Models

Dan Shi, Zhuowen Han, Simon Ostermann, Renren Jin, Josef van Genabith, Deyi Xiong

PDF

1 Repo

TL;DR

This study investigates how reinforcement learning enhances large language models' reasoning abilities by analyzing internal feature changes, revealing that RL preserves core features and identifies key features mediating generalization.

Contribution

The paper introduces a feature-level interpretability framework to compare RL and supervised fine-tuning, uncovering mechanisms behind RL's superior generalization in large language models.

Findings

01

RL induces restrained, evolving feature changes that preserve base representations.

02

Specialized features stabilize early in supervised fine-tuning, potentially causing forgetting.

03

Disabling identified features reduces RL models' generalization, confirming their causal role.

Abstract

Reinforcement learning (RL)-based post-training often improves the reasoning performance of large language models (LLMs) beyond the training domain, while supervised fine-tuning (SFT) frequently leads to general capabilities forgetting. However, the mechanisms underlying this contrast remain unclear. To bridge this gap, we present a feature-level mechanistic analysis methodology to probe RL generalization using a controlled experimental setup, where RL- and SFT-tuned models are trained from the same base model on identical data. Leveraging our interpretability framework, we align internal activations across models within a shared feature space and analyze how features evolve during post-training. We find that SFT rapidly introduces many highly specialized features that stabilize early in training, whereas RL induces more restrained and continually evolving feature changes that largely…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

danshi777/RL-generalization
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.