TL;DR
This paper distinguishes between strong and weak value alignment in AI, emphasizing the importance of cognitive abilities for true alignment and analyzing current LLMs' limitations in recognizing value-related risks.
Contribution
It introduces the concepts of strong and weak alignment, analyzes LLMs' understanding of human values, and proposes a new thought experiment to explore alignment challenges.
Findings
LLMs often fail to recognize situations risking human values.
Word embeddings show discrepancies between LLMs' and humans' value representations.
Current weak alignment methods lack guarantees of truthfulness.
Abstract
Minimizing negative impacts of Artificial Intelligent (AI) systems on human societies without human supervision requires them to be able to align with human values. However, most current work only addresses this issue from a technical point of view, e.g., improving current methods relying on reinforcement learning from human feedback, neglecting what it means and is required for alignment to occur. Here, we propose to distinguish strong and weak value alignment. Strong alignment requires cognitive abilities (either human-like or different from humans) such as understanding and reasoning about agents' intentions and their ability to causally produce desired effects. We argue that this is required for AI systems like large language models (LLMs) to be able to recognize situations presenting a risk that human values may be flouted. To illustrate this distinction, we present a series of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsALIGN
