SCAR: Shapley Credit Assignment for More Efficient RLHF

Meng Cao; Shuyuan Zhang; Xiao-Wen Chang; Doina Precup

arXiv:2505.20417·cs.AI·May 28, 2025

SCAR: Shapley Credit Assignment for More Efficient RLHF

Meng Cao, Shuyuan Zhang, Xiao-Wen Chang, Doina Precup

PDF

Open Access 3 Reviews

TL;DR

SCAR introduces a Shapley value-based method for more accurate credit assignment in RLHF, improving training efficiency and alignment quality of large language models without extra annotation or critique models.

Contribution

The paper proposes SCAR, a novel Shapley value-based reward distribution method that enhances credit assignment in RLHF without additional annotations or critique models.

Findings

01

SCAR converges faster than standard RLHF methods.

02

SCAR achieves higher reward scores across multiple NLP tasks.

03

SCAR maintains theoretical optimal policy properties.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is a widely used technique for aligning Large Language Models (LLMs) with human preferences, yet it often suffers from sparse reward signals, making effective credit assignment challenging. In typical setups, the reward model provides a single scalar score for an entire generated sequence, offering little insight into which token or span-level decisions were responsible for the outcome. To address this, we propose Shapley Credit Assignment Rewards (SCAR), a novel method that leverages Shapley values in cooperative game theory. SCAR distributes the total sequence-level reward among constituent tokens or text spans based on their principled marginal contributions. This creates dense reward signals, crucially, without necessitating the training of auxiliary critique models or recourse to fine-grained human annotations at intermediate…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

1. The use of Shapley values to redistribute sparse rewards into dense rewards has a strong theoretical foundation. Further, the dense learning signal does not change the optimal policy. 2. Adaptive segmentation and Owen value approximations make the approach tractable and suited for real-world use. 3. Experiments show that SCAR achieves faster convergence and preference compared to the baselines.

Weaknesses

1. The proposed method focuses on a novel way to convert sparse reward into dense reward(s) and the experiments demonstrate the efficacy of dense rewards in PPO training. However, there are other RL methods, e.g. OREO (Wang et al., 2025) and DQO (Liu et al., 2024), that explicitly formulate intermediate steps as an MDP and employ soft Q-learning to learn the policy. These methods can also use process level supervision. A major concern with the experiment section is that they focus only on impr

Reviewer 02Rating 6Confidence 4

Strengths

Originality: Introducing the Shapley value into RLHF credit assignment represents an innovative integration of cooperative game theory and RLHF, distinct from heuristic methods that rely on attention weights (e.g., ABC). Quality: The method is rigorously designed, encompassing theoretical analysis (policy invariance), efficient approximations (Owen value, adaptive segmentation), and comprehensive experiments (three tasks, multiple baselines, and statistical validation). Clarity: The paper is wel

Weaknesses

Computational overhead: Despite reducing complexity through the use of Owen values and adaptive segmentation, Shapley value approximation still introduces significant computational costs (e.g., Token-level SCAR requires 48 GPU hours for the summarization task, compared to 7 hours for Span-level). It may remain impractical for extremely long sequences or large-scale LLMs. Reward model assumptions: SCAR assumes that the reward model can provide meaningful scoring for partial sequences (Eq. 3). How

Reviewer 03Rating 4Confidence 4

Strengths

1. Introduces a principled, game-theoretic approach to reward attribution that is potentially more interpretable than heuristic token credit schemes. 2. Provides token/span-level granularity, which could help better and fast convergence. 3. Presents a clear, modular framework that can, be combined with standard RLHF pipelines.

Weaknesses

1. While Shapley values do not require independence per se, they assume a well-defined utility for any coalition of “players.” In LLMs, tokens are highly dependent and causal; the contribution of a token is context-sensitive and may change as later tokens are generated. The paper should discuss how SCAR accounts for this sequential dependence beyond treating tokens as interchangeable players. 2. Exact Shapley computation requires evaluating the utility for arbitrary coalitions. Using the rewar

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems