SEAL: Systematic Error Analysis for Value ALignment

Manon Revel; Matteo Cargnelutti; Tyna Eloundou; Greg Leppert

arXiv:2408.10270·cs.LG·August 21, 2024

SEAL: Systematic Error Analysis for Value ALignment

Manon Revel, Matteo Cargnelutti, Tyna Eloundou, Greg Leppert

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces new metrics to evaluate how well reward models in RLHF align language models with human values, revealing significant imprints and sensitivities that impact alignment quality.

Contribution

It proposes novel metrics like feature imprint, alignment resistance, and robustness to better understand and quantify value alignment in RLHF systems.

Findings

01

High feature imprint of target values in RMs

02

26% incidence of alignment resistance in dataset portions

03

Misaligned responses often stem from ambiguous dataset entries

Abstract

Reinforcement Learning from Human Feedback (RLHF) aims to align language models (LMs) with human values by training reward models (RMs) on binary preferences and using these RMs to fine-tune the base LMs. Despite its importance, the internal mechanisms of RLHF remain poorly understood. This paper introduces new metrics to evaluate the effectiveness of modeling and aligning human values, namely feature imprint, alignment resistance and alignment robustness. We categorize alignment datasets into target features (desired values) and spoiler features (undesired concepts). By regressing RM scores against these features, we quantify the extent to which RMs reward them - a metric we term feature imprint. We define alignment resistance as the proportion of the preference dataset where RMs fail to match human preferences, and we assess alignment robustness by analyzing RM responses to perturbed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

harvard-lil/SEAL
noneOfficial

Videos

SEAL: Systematic Error Analysis for Value ALignment· underline

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Recommender Systems and Techniques · Topic Modeling

MethodsALIGN · Balanced Selection