Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL
Chenlu Ye, Xuanchang Zhang, Yifan Hao, Zhou Yu, Ziji Zhang, Abhinav Gullapalli, Hao Chen, Jing Huang, Tong Zhang

TL;DR
This paper introduces Adaptive Layerwise Perturbation (ALP), a method that injects controlled noise into layer inputs during training to stabilize off-policy learning in large language models, reducing importance ratio tails and improving performance.
Contribution
ALP is a novel approach that applies learnable perturbations at each layer to unify off-policy corrections, enhancing training stability and exploration in LLM RL.
Findings
ALP reduces importance ratio tail heaviness and KL spikes during training.
ALP improves final performance on math and reasoning tasks.
Representation-level perturbations across all layers are most effective.
Abstract
Off-policy problems such as policy staleness and training--inference mismatch have become a major bottleneck for training stability and further exploration in LLM RL. The distribution gap between the inference and updated policies grows because of the techniques to enhance inference efficiency, leading to heavy-tailed importance ratios. Heavy-tailed ratios arise when the policy is locally sharp, which further inflates gradients and can push updates outside the trust region. To address this, we propose Adaptive Layerwise Perturbation (ALP), which injects small learnable perturbations into the input hidden states of each layer during updates and uses the resulting perturbed policy as the numerator of the importance ratio against the unchanged inference policy in the objective. Intuitively, by adding controlled noise to intermediate representations, ALP prevents the updated policy from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
