TL;DR
This study investigates how reinforcement learning enhances large language models' reasoning abilities by analyzing internal feature changes, revealing that RL preserves core features and identifies key features mediating generalization.
Contribution
The paper introduces a feature-level interpretability framework to compare RL and supervised fine-tuning, uncovering mechanisms behind RL's superior generalization in large language models.
Findings
RL induces restrained, evolving feature changes that preserve base representations.
Specialized features stabilize early in supervised fine-tuning, potentially causing forgetting.
Disabling identified features reduces RL models' generalization, confirming their causal role.
Abstract
Reinforcement learning (RL)-based post-training often improves the reasoning performance of large language models (LLMs) beyond the training domain, while supervised fine-tuning (SFT) frequently leads to general capabilities forgetting. However, the mechanisms underlying this contrast remain unclear. To bridge this gap, we present a feature-level mechanistic analysis methodology to probe RL generalization using a controlled experimental setup, where RL- and SFT-tuned models are trained from the same base model on identical data. Leveraging our interpretability framework, we align internal activations across models within a shared feature space and analyze how features evolve during post-training. We find that SFT rapidly introduces many highly specialized features that stabilize early in training, whereas RL induces more restrained and continually evolving feature changes that largely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
