Gradient-Based LoRA Rank Allocation Under GRPO: An Empirical Study
Yash Ganpat Sawant

TL;DR
This study examines whether adaptive rank allocation techniques effective in supervised fine-tuning also improve reinforcement learning, finding that naive transfer can be detrimental due to different gradient landscape properties.
Contribution
The paper provides empirical evidence that gradient-based rank allocation strategies do not transfer well from supervised fine-tuning to reinforcement learning, highlighting fundamental differences.
Findings
Proportional rank allocation degrades accuracy in GRPO by 4.5 points.
Gradient landscape under GRPO is flatter with less importance spread than in SFT.
Non-uniform allocation amplifies importance spread, creating a feedback loop.
Abstract
Adaptive rank allocation for LoRA, allocating more parameters to important layers and fewer to unimportant ones, consistently improves efficiency under supervised fine-tuning (SFT). We investigate whether this success transfers to reinforcement learning, specifically Group Relative Policy Optimization (GRPO). Using gradient-magnitude profiling on Qwen 2.5 1.5B with GSM8K, we find that it does not: proportional rank allocation degrades accuracy by 4.5 points compared to uniform allocation (70.0% vs. 74.5%), despite using identical parameter budgets. We identify two mechanisms behind this failure. First, the gradient landscape under GRPO is fundamentally flatter than under SFT, the max-to-min layer importance ratio is only 2.17x, compared to >10x reported in SFT literature. All layers carry meaningful gradient signal; none are truly idle. Second, we discover a gradient amplification…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
