Strong and weak alignment of large language models with human values

Mehdi Khamassi; Marceau Nahon; Raja Chatila

arXiv:2408.04655·cs.CL·August 13, 2024

Strong and weak alignment of large language models with human values

Mehdi Khamassi, Marceau Nahon, Raja Chatila

PDF

1 Repo

TL;DR

This paper distinguishes between strong and weak value alignment in AI, emphasizing the importance of cognitive abilities for true alignment and analyzing current LLMs' limitations in recognizing value-related risks.

Contribution

It introduces the concepts of strong and weak alignment, analyzes LLMs' understanding of human values, and proposes a new thought experiment to explore alignment challenges.

Findings

01

LLMs often fail to recognize situations risking human values.

02

Word embeddings show discrepancies between LLMs' and humans' value representations.

03

Current weak alignment methods lack guarantees of truthfulness.

Abstract

Minimizing negative impacts of Artificial Intelligent (AI) systems on human societies without human supervision requires them to be able to align with human values. However, most current work only addresses this issue from a technical point of view, e.g., improving current methods relying on reinforcement learning from human feedback, neglecting what it means and is required for alignment to occur. Here, we propose to distinguish strong and weak value alignment. Strong alignment requires cognitive abilities (either human-like or different from humans) such as understanding and reasoning about agents' intentions and their ability to causally produce desired effects. We argue that this is required for AI systems like large language models (LLMs) to be able to recognize situations presenting a risk that human values may be flouted. To illustrate this distinction, we present a series of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

marceaunahon/gpt_embeddings
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsALIGN