TL;DR
This paper introduces DelTA, a novel method for token-level credit assignment in reinforcement learning from verifiable rewards, enhancing the interpretability and effectiveness of policy updates in large language models.
Contribution
DelTA estimates token coefficients to amplify discriminative token-gradient directions, improving reward-based learning in language models beyond existing centroid-based methods.
Findings
DelTA outperforms strong baselines on mathematical benchmarks.
It improves code generation and out-of-domain performance.
DelTA enhances the contrastiveness of RLVR updates.
Abstract
Reinforcement learning from verifiable rewards (RLVR) has emerged as a central technique for improving the reasoning capabilities of large language models. Despite its effectiveness, how response-level rewards translate into token-level probability changes remains poorly understood. We introduce a discriminator view of RLVR updates, showing that the policy-gradient update direction implicitly acts as a linear discriminator over token-gradient vectors and thereby determines which token probabilities are increased or decreased during learning. Under standard sequence-level RLVR, this discriminator is constructed from positive- and negative-side centroids formed by advantage-weighted averaging of token-gradient vectors. However, such centroid construction can be dominated by shared high-frequency patterns, such as formatting tokens, diluting sparse yet discriminative directions that better…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
