Understanding Impact of Human Feedback via Influence Functions
Taywon Min, Haeone Lee, Yongchan Kwon, Kimin Lee

TL;DR
This paper introduces a method using influence functions to measure and improve the impact of human feedback on reward models in RLHF, addressing noise and bias issues for better alignment of language models.
Contribution
It proposes a compute-efficient influence function approach to detect biases and guide feedback refinement in large-scale RLHF datasets.
Findings
Effectively detects labeler biases in feedback datasets
Guides labelers to improve feedback quality
Enhances interpretability of human feedback impact
Abstract
In Reinforcement Learning from Human Feedback (RLHF), it is crucial to learn suitable reward models from human feedback to align large language models (LLMs) with human intentions. However, human feedback can often be noisy, inconsistent, or biased, especially when evaluating complex responses. Such feedback can lead to misaligned reward signals, potentially causing unintended side effects during the RLHF process. To address these challenges, we explore the use of influence functions to measure the impact of human feedback on the performance of reward models. We propose a compute-efficient approximation method that enables the application of influence functions to LLM-based reward models and large-scale preference datasets. Our experiments showcase two key applications of influence functions: (1) detecting common labeler biases in human feedback datasets and (2) guiding labelers in…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The use of influence functions to analyze the impact of human feedback is a promising direction that adds a layer of interpretability in RLHF, which is essential for aligning LLMs with human values. 2. The paper introduces a compute-efficient method that enables scalable application of influence functions, potentially reducing computational demands by 2.5 times, a significant improvement over previous methods. 3. The methodology, experimental design, and results are presented clearly, with we
1. The study’s limitations in real-world scenarios, where expert and non-expert labelers may not share sub-objective scores, could reduce the generalizability of the approach. 2. While the paper shows effectiveness in detecting length bias, sycophancy bias remains challenging, as it involves understanding nuanced human agreement tendencies that may vary by context.
This paper excels in identifying data points that negatively influence reward models through the application of influence functions. Given the common issue of noise in human-labeled data, particularly in RLHF, this approach enhances transparency and is valuable in managing labels from non-expert labelers. The methodology effectively addresses computational challenges using an approximation function, which is highly practical aiming for reliable AI. The experiments are well-defined, and the perf
While the application of influence functions to assess the contribution of individual data points is compelling, the methodology relies heavily on manually curated validation sets, as evidenced by the ablation experiments. There are two primary concerns: 1) Dependence on Domain Knowledge: The construction of validation sets requires domain-specific knowledge, which could limit generalizability. Although addressing verbosity and sycophancy is valuable, the methodology appears capable of handling
The paper introduces a novel approach to enhance the interpretability of reward models. By applying influence functions, the authors provide a method to quantify the impact of individual feedback on the model's performance, offering insights into how human feedback shapes the reward model's outcomes. The idea of using influence functions to measure the impact of human feedback is innovative and has the potential to contribute to the broader goal of scalable oversight in RLHF. This approach can
As far as I am concerned, the authors simpy apply the approach in [1] to the reward modeling scenario, which greatly limits the novelty of the paper. I suggest that the author summarize the main contributions. While the experiments show promise, establishing reward models with various LLMs and evaluating with more downstream alignment tasks, such as direct alignment algorithms, could further validate the generalizability of the approach. Reference: [1] Koh P W, Liang P. Understanding black-box
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplex Systems and Decision Making
MethodsALIGN
