GRADE: Replacing Policy Gradients with Backpropagation for LLM Alignment

Lukas Abrie Nel

arXiv:2601.11574·cs.LG·January 21, 2026

GRADE: Replacing Policy Gradients with Backpropagation for LLM Alignment

Lukas Abrie Nel

PDF

Open Access

TL;DR

GRADE introduces a differentiable relaxation technique for LLM alignment, replacing high-variance policy gradients with backpropagation, leading to more stable training and improved reward performance.

Contribution

The paper presents GRADE, a novel method that replaces policy gradient estimation with backpropagation using Gumbel-Softmax, enabling more stable and efficient LLM alignment.

Findings

01

GRADE-STE achieves higher test rewards than PPO and REINFORCE.

02

GRADE-STE exhibits over 14 times lower gradient variance.

03

The method generalizes well to held-out data.

Abstract

Reinforcement learning from human feedback (RLHF) has become the dominant paradigm for aligning large language models with human preferences. However, policy gradient methods such as PPO suffer from high variance gradient estimates, requiring careful hyperparameter tuning and extensive computational resources. We introduce GRADE (Gumbel-softmax Relaxation for Alignment via Differentiable Estimation), a method that replaces high-variance policy gradient estimation with direct backpropagation through a differentiable relaxation of the discrete token sampling process. Using the Gumbel-Softmax reparameterization with straight-through estimation (GRADE-STE), we enable end-to-end gradient flow from reward signals through generated tokens to model parameters. On sentiment-controlled text generation using the IMDB dataset, GRADE-STE achieves a test reward of 0.763 +- 0.344 compared to PPO's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Sentiment Analysis and Opinion Mining · Natural Language Processing Techniques