Human Feedback is not Gold Standard
Tom Hosking, Phil Blunsom, Max Bartolo

TL;DR
This paper critically examines the reliability of human feedback as a standard for evaluating and training large language models, highlighting its biases and limitations in capturing factual accuracy and other error criteria.
Contribution
It provides a detailed analysis of the biases in human preference scores and demonstrates how they can skew model evaluation and training, especially regarding factuality and assertiveness.
Findings
Preference scores under-represent factuality issues.
Output assertiveness influences perceived factual errors.
Human feedback may unintentionally increase model assertiveness.
Abstract
Human feedback has become the de facto standard for evaluating the performance of Large Language Models, and is increasingly being used as a training objective. However, it is not clear which properties of a generated output this single `preference' score captures. We hypothesise that preference scores are subjective and open to undesirable biases. We critically analyse the use of human feedback for both training and evaluation, to verify whether it fully captures a range of crucial error criteria. We find that while preference scores have fairly good coverage, they under-represent important aspects like factuality. We further hypothesise that both preference scores and error annotation may be affected by confounders, and leverage instruction-tuned models to generate outputs that vary along two possible confounding dimensions: assertiveness and complexity. We find that the assertiveness…
Peer Reviews
Decision·ICLR 2024 poster
- RLHF and human preference alignment topic for LLM human-AI alignment is one of the highly discussed topics today for chatbot training. RLHF is used to teach the models to generate safer outputs (for example, refuse to respond when inappropriate, promote politeness and reduce toxicity and biases). The studies in the paper suggest that the human preferences can easily get confounded with assertive and complex text and prefer those therefore introduce this 'assertiveness bias' and 'complexity bia
- it is not clear if this error type categorization is comprehensive enough; authors haven't provided any empirical/experimental support for these error categories. - There are other important biases from AI safety perspective that isn't explicitly studied in the paper (could possibly make the paper more relevant) - hallucinations, more fine-grained categorization of inconsistencies, toxicity, etc - the datasets studied may not be comprehensive enough to have captured some of the important dime
This paper studies a very important problem: that without understanding the underlying factors influencing human judgments of text, in light of ambiguous and underspecified annotation guidelines asking for nebulous quality ratings, when performing standard RLHF we are optimizing models in unintentional directions. The analysis of annotations is relatively thorough and easy to understand.
A couple of points: * I wish there were discussion on not just the problem of individual judgments being ambiguous, but that standard RLHF is optimizing towards a single user preference per example rather than considering factors resulting in a distribution over judgments for a number of different annotators (a point that is made e.g., as a motivation for jury learning, Gordon et al. 2022). * I would also like more discussion on what to do in light of these findings. There were hints at one poss
* The paper studies an important problem: what attributes do human labelers actually care about in model outputs, and does this match the attributes we'd like * The paper writing is clear throughout and easy to follow * Many of the empirical results would likely be interesting to the community at large; I especially liked Figure 5, which shows that simply increasing assertiveness reduces the error rates for many categories (e.g., factuality)
* The paper feels a bit ad-hoc; the properties tested seemed kind of arbitrarily chosen, but the specific set tested should have a significant impact on the learned lasso weights (e.g., if one feature is more predictive than the rest, the lasso weight would put all of it on that) * Some aspects of the paper do not engage with prior work. For example, it claims that “human feedback the de facto standard for evaluation” with no citation, and does not engage with the extensive work benchmarking LLM
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Natural Language Processing Techniques
