Automated alignment is harder than you think
Aleksandr Bowkis, Marie Davidsen Buhl, Jacob Pfau, Geoffrey Irving

TL;DR
Automated alignment research using AI agents faces significant challenges due to systematic errors in fuzzy tasks, risking misleading safety assessments and potential misdeployment of misaligned AI systems.
Contribution
This paper highlights the difficulties and risks of using AI agents for alignment research, emphasizing the need for reliable training methods like generalisation and scalable oversight.
Findings
AI-generated alignment outputs may contain undetected errors.
Errors are likely to be concentrated among those hard for humans to catch.
Shared training processes may cause outputs to be more correlated than human ones.
Abstract
A leading proposal for aligning artificial superintelligence (ASI) is to use AI agents to automate an increasing fraction of alignment research as capabilities improve. We argue that, even when research agents are not scheming to deliberately sabotage alignment work, this plan could produce compelling but catastrophically misleading safety assessments resulting in the unintentional deployment of misaligned AI. This could happen because alignment research involves many hard-to-supervise fuzzy tasks (tasks without clear evaluation criteria, for which human judgement is systematically flawed). Consequently, research outputs will contain systematic, undetected errors, and even correct outputs could be incorrectly aggregated into overconfident safety assessments. This problem is likely to be worse for automated alignment research than for human-generated alignment research for several…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
