More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness
Aaron J. Li, Satyapriya Krishna, Himabindu Lakkaraju

TL;DR
This paper evaluates how reinforcement learning from human feedback (RLHF) affects the trustworthiness of large language models across multiple dimensions, revealing that RLHF does not always improve trustworthiness and can sometimes have reverse effects.
Contribution
It provides a rigorous evaluation of RLHF's impact on trustworthiness and introduces influence function-based data attribution methods to analyze data influence on trustworthiness benchmarks.
Findings
RLHF does not automatically enhance trustworthiness.
Reverse effects of RLHF on trustworthiness are observed.
Data attribution methods can identify influential training data.
Abstract
The trustworthiness of Large Language Models (LLMs) refers to the extent to which their outputs are reliable, safe, and ethically aligned, and it has become a crucial consideration alongside their cognitive performance. In practice, Reinforcement Learning From Human Feedback (RLHF) has been widely used to align LLMs with labeled human preferences, but its assumed effect on model trustworthiness hasn't been rigorously evaluated. To bridge this knowledge gap, this study investigates how models aligned with general-purpose preference data perform across five trustworthiness verticals: toxicity, stereotypical bias, machine ethics, truthfulness, and privacy. Our results demonstrate that RLHF on human preferences doesn't automatically guarantee trustworthiness, and reverse effects are often observed. Furthermore, we propose to adapt efficient influence function based data attribution methods…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsALIGN · Focus
