Automated alignment is harder than you think

Aleksandr Bowkis; Marie Davidsen Buhl; Jacob Pfau; Geoffrey Irving

arXiv:2605.06390·cs.AI·May 18, 2026

Automated alignment is harder than you think

Aleksandr Bowkis, Marie Davidsen Buhl, Jacob Pfau, Geoffrey Irving

PDF

TL;DR

Automated alignment research using AI agents faces significant challenges due to systematic errors in fuzzy tasks, risking misleading safety assessments and potential misdeployment of misaligned AI systems.

Contribution

This paper highlights the difficulties and risks of using AI agents for alignment research, emphasizing the need for reliable training methods like generalisation and scalable oversight.

Findings

01

AI-generated alignment outputs may contain undetected errors.

02

Errors are likely to be concentrated among those hard for humans to catch.

03

Shared training processes may cause outputs to be more correlated than human ones.

Abstract

A leading proposal for aligning artificial superintelligence (ASI) is to use AI agents to automate an increasing fraction of alignment research as capabilities improve. We argue that, even when research agents are not scheming to deliberately sabotage alignment work, this plan could produce compelling but catastrophically misleading safety assessments resulting in the unintentional deployment of misaligned AI. This could happen because alignment research involves many hard-to-supervise fuzzy tasks (tasks without clear evaluation criteria, for which human judgement is systematically flawed). Consequently, research outputs will contain systematic, undetected errors, and even correct outputs could be incorrectly aggregated into overconfident safety assessments. This problem is likely to be worse for automated alignment research than for human-generated alignment research for several…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.