More RLHF, More Trust? On The Impact of Preference Alignment On   Trustworthiness

Aaron J. Li; Satyapriya Krishna; Himabindu Lakkaraju

arXiv:2404.18870·cs.CL·December 24, 2024

More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness

Aaron J. Li, Satyapriya Krishna, Himabindu Lakkaraju

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper evaluates how reinforcement learning from human feedback (RLHF) affects the trustworthiness of large language models across multiple dimensions, revealing that RLHF does not always improve trustworthiness and can sometimes have reverse effects.

Contribution

It provides a rigorous evaluation of RLHF's impact on trustworthiness and introduces influence function-based data attribution methods to analyze data influence on trustworthiness benchmarks.

Findings

01

RLHF does not automatically enhance trustworthiness.

02

Reverse effects of RLHF on trustworthiness are observed.

03

Data attribution methods can identify influential training data.

Abstract

The trustworthiness of Large Language Models (LLMs) refers to the extent to which their outputs are reliable, safe, and ethically aligned, and it has become a crucial consideration alongside their cognitive performance. In practice, Reinforcement Learning From Human Feedback (RLHF) has been widely used to align LLMs with labeled human preferences, but its assumed effect on model trustworthiness hasn't been rigorously evaluated. To bridge this knowledge gap, this study investigates how models aligned with general-purpose preference data perform across five trustworthiness verticals: toxicity, stereotypical bias, machine ethics, truthfulness, and privacy. Our results demonstrate that RLHF on human preferences doesn't automatically guarantee trustworthiness, and reverse effects are often observed. Furthermore, we propose to adapt efficient influence function based data attribution methods…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ai4life-group/rlhf_trust
pytorchOfficial

Videos

More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness· slideslive

Taxonomy

TopicsNatural Language Processing Techniques

MethodsALIGN · Focus