The Alignment Problem in Context
Rapha\"el Milli\`ere

TL;DR
This paper discusses the persistent and fundamental challenges in aligning large language models with human values, highlighting vulnerabilities to adversarial attacks and the intrinsic difficulty of solving the alignment problem without compromising AI capabilities.
Contribution
The paper critically assesses current alignment strategies, linking the difficulty to the models' core ability to learn contextually from instructions, and discusses implications for future AI safety.
Findings
Existing alignment methods are insufficient against adversarial attacks.
Alignment challenges are deeply connected to models' core learning abilities.
Ensuring safety may compromise AI capabilities.
Abstract
A core challenge in the development of increasingly capable AI systems is to make them safe and reliable by ensuring their behaviour is consistent with human values. This challenge, known as the alignment problem, does not merely apply to hypothetical future AI systems that may pose catastrophic risks; it already applies to current systems, such as large language models, whose potential for harm is rapidly increasing. In this paper, I assess whether we are on track to solve the alignment problem for large language models, and what that means for the safety of future AI systems. I argue that existing strategies for alignment are insufficient, because large language models remain vulnerable to adversarial attacks that can reliably elicit unsafe behaviour. I offer an explanation of this lingering vulnerability on which it is not simply a contingent limitation of current language models,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI
