RL Is Neither a Panacea Nor a Mirage: Understanding Supervised vs. Reinforcement Learning Fine-Tuning for LLMs
Hangzhan Jin, Sicheng Lv, Sifan Wu, Mohammad Hamdaqa

TL;DR
This paper investigates how supervised and reinforcement learning fine-tuning affect large language models' out-of-distribution performance, revealing that RL mainly counteracts SFT-induced directional drift and identifying efficient recovery methods.
Contribution
It provides a spectrum-based analysis of fine-tuning effects, showing RL's role in restoring OOD performance and introducing practical recovery techniques.
Findings
RL can significantly recover OOD performance lost after SFT.
Directional shifts in singular vectors are key to understanding model changes.
Low-rank and shallow recovery methods restore most OOD performance.
Abstract
Training large language models (LLMs) from scratch is increasingly impractical, making post-training methods such as supervised fine-tuning (SFT) and reinforcement-learning fine-tuning (RL-FT, e.g., PPO) central to modern practice. Using an out-of-distribution (OOD) variant of the 24-point card game and new spectrum-based diagnostics, we revisit how these two stages reshape model representation and OOD performance. Our key findings are- (1) RL-FT can restore much of the OOD performance loss from SFT (e.g., Llama-11B 8.97% to 15.38%, Qwen-7B 17.09% to 19.66%). But when SFT induces severe overfitting and a clear distribution shift, RL-FT cannot fully recover OOD performance. (2) Direction shifts of singular vectors matter more than singular value magnitudes. These shifts concentrate on directions linked to the largest and smallest singular values, leaving the bulk spectrum intact. (3)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
