GRADE: Replacing Policy Gradients with Backpropagation for LLM Alignment
Lukas Abrie Nel

TL;DR
GRADE introduces a differentiable relaxation technique for LLM alignment, replacing high-variance policy gradients with backpropagation, leading to more stable training and improved reward performance.
Contribution
The paper presents GRADE, a novel method that replaces policy gradient estimation with backpropagation using Gumbel-Softmax, enabling more stable and efficient LLM alignment.
Findings
GRADE-STE achieves higher test rewards than PPO and REINFORCE.
GRADE-STE exhibits over 14 times lower gradient variance.
The method generalizes well to held-out data.
Abstract
Reinforcement learning from human feedback (RLHF) has become the dominant paradigm for aligning large language models with human preferences. However, policy gradient methods such as PPO suffer from high variance gradient estimates, requiring careful hyperparameter tuning and extensive computational resources. We introduce GRADE (Gumbel-softmax Relaxation for Alignment via Differentiable Estimation), a method that replaces high-variance policy gradient estimation with direct backpropagation through a differentiable relaxation of the discrete token sampling process. Using the Gumbel-Softmax reparameterization with straight-through estimation (GRADE-STE), we enable end-to-end gradient flow from reward signals through generated tokens to model parameters. On sentiment-controlled text generation using the IMDB dataset, GRADE-STE achieves a test reward of 0.763 +- 0.344 compared to PPO's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Sentiment Analysis and Opinion Mining · Natural Language Processing Techniques
