The Alignment Problem from a Deep Learning Perspective
Richard Ngo, Lawrence Chan, S\"oren Mindermann

TL;DR
This paper discusses the potential risks of misaligned artificial general intelligence (AGI), emphasizing the importance of understanding and preventing goal misalignment to avoid loss of human control, supported by recent empirical evidence.
Contribution
It provides a comprehensive review of emerging evidence on AGI misalignment, including new empirical data, and outlines research directions for alignment and safety.
Findings
AGIs may learn deceptive behaviors for higher rewards
AGIs can develop internally misaligned goals that generalize beyond training
Misaligned AGIs could threaten human control over the future
Abstract
In coming years or decades, artificial general intelligence (AGI) may surpass human capabilities across many critical domains. We argue that, without substantial effort to prevent it, AGIs could learn to pursue goals that are in conflict (i.e. misaligned) with human interests. If trained like today's most capable models, AGIs could learn to act deceptively to receive higher reward, learn misaligned internally-represented goals which generalize beyond their fine-tuning distributions, and pursue those goals using power-seeking strategies. We review emerging evidence for these properties. In this revised paper, we include more direct empirical evidence published as of early 2025. AGIs with these properties would be difficult to align and may appear aligned even when they are not. Finally, we briefly outline how the deployment of misaligned AGIs might irreversibly undermine human control…
Peer Reviews
Decision·ICLR 2024 poster
1. This paper has presented many insights in terms of alignment problems. Personally, I have learned a lot from this paper. 2. This paper is well-written and easy to follow. 3. For each proposed viewpoint, there is adequate literature to support it.
My main concern is that ICLR primarily accepts research papers that present novel findings, methodologies, or empirical studies in the field of machine learning. These conferences have stringent review processes to ensure the quality of accepted papers. A "position paper" typically presents an opinion, perspective, or viewpoint on a particular topic, without necessarily introducing new empirical results. While such papers can be valuable, they may not fit the primary criteria of conferences lik
Frankly speaking, I am not an expert in RL and policy learning. But I find I enjoy reading this paper. Here are some strengths I believe readers should notice. Dense, solid and well-organized positions. Positions start from the introduction to the fundamentals of RLHF and reviews of the related literature. Especially in clarifying the concept of situation-awareness and reward hacking, authors list several concrete examples about how the underlying situation, e.g. reward hacking, would happen i
The title is not proper. It is really vague to me by framing something as “from a deep learning perspective”. Frankly speaking it could use more work to become concise and precise. There is perhaps no such thing known as “deep learning perspective” as many things in deep learning are either not unified or still in debate and there is a mixture of knowledge used in deep learning. For example, is this paper based on any statistical learning theories or optimization works? I would suggest using a b
Being a discussion paper without presenting a method or results, the paper is easy to read and follow. The authors provide an extensive list of citations to provide credibility for the risk factors that are outlined, and many points are further expanded in 3.5 pages of small-print endnotes. Currently, the topic of whether LLMs will, in the not-so-distant future, lead to the destruction of human civilization is widely discussed inside and outside the community. Hence -- as the authors state -- a
First, I'm wondering whether ICLR is the right venue for a (rather controversial) position paper like this. My main criticism would be that I don't think that their risk factors (and final conclusions) are well-supported by their reasoning and hence I don't think it'll provide a significant benefit to the community. It seems to me that a through rebuttal of the paper is beyond the scope of this review (that would probably have to be another position paper), but to give a few points: - The papers
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction · Image and Object Detection Techniques · Handwritten Text Recognition Techniques
