Text2Grad: Reinforcement Learning from Natural Language Feedback

Hanyang Wang; Lu Wang; Chaoyun Zhang; Tianjun Mao; Si Qin; Qingwei Lin; Saravan Rajmohan; Dongmei Zhang

arXiv:2505.22338·cs.CL·January 28, 2026

Text2Grad: Reinforcement Learning from Natural Language Feedback

Hanyang Wang, Lu Wang, Chaoyun Zhang, Tianjun Mao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang

PDF

Open Access 3 Reviews

TL;DR

Text2Grad introduces a novel reinforcement learning approach that converts natural language feedback into span-level gradients, enabling fine-grained, interpretable, and effective model updates for tasks like summarization and code generation.

Contribution

It presents a new RL paradigm that uses span-level feedback from natural language critiques to directly refine language model policies, improving alignment and interpretability.

Findings

01

Outperforms scalar-reward RL and prompt-only baselines in multiple tasks.

02

Provides higher task metrics and richer interpretability.

03

Enables precise, feedback-conditioned model adjustments.

Abstract

Traditional RLHF optimizes language models with coarse, scalar rewards that mask the fine-grained reasons behind success or failure, leading to slow and opaque learning. Recent work augments RL with textual critiques through prompting or reflection, improving interpretability but leaving model parameters untouched. We introduce Text2Grad, a reinforcement-learning paradigm that turns free-form textual feedback into span-level gradients. Given human (or programmatic) critiques, Text2Grad aligns each feedback phrase with the relevant token spans, converts these alignments into differentiable reward signals, and performs gradient updates that directly refine the offending portions of the model's policy. This yields precise, feedback-conditioned adjustments instead of global nudges. Text2Grad is realized through three components: (1) a high-quality feedback-annotation pipeline that pairs…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. The motivation of the proposed method is intuitive and the proposed method can support the motivation. 2. The empirical performance is validated through three different tasks, which shows the generalization ability of the proposed method. 3. This paper is overall well-written and easy to follow.

Weaknesses

1. This paper misses existing works on reinforcement learning from natural language feedback especially in the training process, such as [1]. Since the proposed method exactly falls into this line of work, the authors should add a discussion to highlight the core novelty of the proposed method. 2. The proposed method jointly generates free-form natural language critiques and structured span-level reward labels. But how natural language critiques can help improve the learning of span-level rewar

Reviewer 02Rating 6Confidence 4

Strengths

1. The paper is well-written and easy to follow. 2. The method is novel and well-motivated which bridges interpretable textual feedback and gradient-based optimization, addressing a real limitation of scalar RLHF. 3. The method shows strong empirical results across diverse tasks with consistent improvements over strong baselines.

Weaknesses

1. Binary pseudo-rewards discard fine-grained information in textual critiques. Moreover, treating all tokens within a labeled span equally is simple but ignores token importance. 2. Computation overhead could be critical for method adoption, yet there is no empirical measurement comparing Text2Grad with other baseline methods.

Reviewer 03Rating 2Confidence 4

Strengths

1. The motivation is straightforward and easy to follow, around credit assignment and interpretability in RLHF. 2. It introduces the LLM annotation pipeline in detail, which makes it clear for reproduction. 3. Experiment results show consistent improvements across different datasets. The authors also provide a cherry-picked example to show its interpretability.

Weaknesses

1. “NL-gradient” terminology is inaccurate: the method does not differentiate through language, but converts text to discrete span, and assign different span as token-level rewards (+1/-1 ). The method is not language-conditioned gradient flow. 2. All feedback is GPT-4o-generated, making this RLAIF, not RLHF, with no human validation or study of noisy/contradictory real feedback. The training data distilled from gpt-4o lacks real human feedback. The robustness to noisy or adversarial human cri

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)