The Alignment Problem in Context

Rapha\"el Milli\`ere

arXiv:2311.02147·cs.LG·November 7, 2023·5 cites

The Alignment Problem in Context

Rapha\"el Milli\`ere

PDF

Open Access

TL;DR

This paper discusses the persistent and fundamental challenges in aligning large language models with human values, highlighting vulnerabilities to adversarial attacks and the intrinsic difficulty of solving the alignment problem without compromising AI capabilities.

Contribution

The paper critically assesses current alignment strategies, linking the difficulty to the models' core ability to learn contextually from instructions, and discusses implications for future AI safety.

Findings

01

Existing alignment methods are insufficient against adversarial attacks.

02

Alignment challenges are deeply connected to models' core learning abilities.

03

Ensuring safety may compromise AI capabilities.

Abstract

A core challenge in the development of increasingly capable AI systems is to make them safe and reliable by ensuring their behaviour is consistent with human values. This challenge, known as the alignment problem, does not merely apply to hypothetical future AI systems that may pose catastrophic risks; it already applies to current systems, such as large language models, whose potential for harm is rapidly increasing. In this paper, I assess whether we are on track to solve the alignment problem for large language models, and what that means for the safety of future AI systems. I argue that existing strategies for alignment are insufficient, because large language models remain vulnerable to adversarial attacks that can reliably elicit unsafe behaviour. I offer an explanation of this lingering vulnerability on which it is not simply a contingent limitation of current language models,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI