Understanding Impact of Human Feedback via Influence Functions

Taywon Min; Haeone Lee; Yongchan Kwon; Kimin Lee

arXiv:2501.05790·cs.AI·September 3, 2025

Understanding Impact of Human Feedback via Influence Functions

Taywon Min, Haeone Lee, Yongchan Kwon, Kimin Lee

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces a method using influence functions to measure and improve the impact of human feedback on reward models in RLHF, addressing noise and bias issues for better alignment of language models.

Contribution

It proposes a compute-efficient influence function approach to detect biases and guide feedback refinement in large-scale RLHF datasets.

Findings

01

Effectively detects labeler biases in feedback datasets

02

Guides labelers to improve feedback quality

03

Enhances interpretability of human feedback impact

Abstract

In Reinforcement Learning from Human Feedback (RLHF), it is crucial to learn suitable reward models from human feedback to align large language models (LLMs) with human intentions. However, human feedback can often be noisy, inconsistent, or biased, especially when evaluating complex responses. Such feedback can lead to misaligned reward signals, potentially causing unintended side effects during the RLHF process. To address these challenges, we explore the use of influence functions to measure the impact of human feedback on the performance of reward models. We propose a compute-efficient approximation method that enables the application of influence functions to LLM-based reward models and large-scale preference datasets. Our experiments showcase two key applications of influence functions: (1) detecting common labeler biases in human feedback datasets and (2) guiding labelers in…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 3

Strengths

1. The use of influence functions to analyze the impact of human feedback is a promising direction that adds a layer of interpretability in RLHF, which is essential for aligning LLMs with human values. 2. The paper introduces a compute-efficient method that enables scalable application of influence functions, potentially reducing computational demands by 2.5 times, a significant improvement over previous methods. 3. The methodology, experimental design, and results are presented clearly, with we

Weaknesses

1. The study’s limitations in real-world scenarios, where expert and non-expert labelers may not share sub-objective scores, could reduce the generalizability of the approach. 2. While the paper shows effectiveness in detecting length bias, sycophancy bias remains challenging, as it involves understanding nuanced human agreement tendencies that may vary by context.

Reviewer 02Rating 8Confidence 5

Strengths

This paper excels in identifying data points that negatively influence reward models through the application of influence functions. Given the common issue of noise in human-labeled data, particularly in RLHF, this approach enhances transparency and is valuable in managing labels from non-expert labelers. The methodology effectively addresses computational challenges using an approximation function, which is highly practical aiming for reliable AI. The experiments are well-defined, and the perf

Weaknesses

While the application of influence functions to assess the contribution of individual data points is compelling, the methodology relies heavily on manually curated validation sets, as evidenced by the ablation experiments. There are two primary concerns: 1) Dependence on Domain Knowledge: The construction of validation sets requires domain-specific knowledge, which could limit generalizability. Although addressing verbosity and sycophancy is valuable, the methodology appears capable of handling

Reviewer 03Rating 5Confidence 3

Strengths

The paper introduces a novel approach to enhance the interpretability of reward models. By applying influence functions, the authors provide a method to quantify the impact of individual feedback on the model's performance, offering insights into how human feedback shapes the reward model's outcomes. The idea of using influence functions to measure the impact of human feedback is innovative and has the potential to contribute to the broader goal of scalable oversight in RLHF. This approach can

Weaknesses

As far as I am concerned, the authors simpy apply the approach in [1] to the reward modeling scenario, which greatly limits the novelty of the paper. I suggest that the author summarize the main contributions. While the experiments show promise, establishing reward models with various LLMs and evaluating with more downstream alignment tasks, such as direct alignment algorithms, could further validate the generalizability of the approach. Reference: [1] Koh P W, Liang P. Understanding black-box

Code & Models

Repositories

mintaywon/if_rlhf
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComplex Systems and Decision Making

MethodsALIGN