Simulating the Evolution of Alignment and Values in Machine Intelligence
Jonathan Elsworth Eicher

TL;DR
This paper models the evolution of beliefs and alignment in AI populations over time, revealing how deceptive beliefs can become fixed and how improved testing can mitigate this.
Contribution
It introduces an evolutionary framework to study belief dynamics in AI alignment, emphasizing the importance of adaptive testing and mutation in reducing deception.
Findings
Correlation between test accuracy and true value remains strong but variable.
Mutations enable complex belief development, risking fixation of deception.
Enhanced evaluator capabilities and adaptive tests significantly reduce deception.
Abstract
Model alignment is currently applied in a vacuum, evaluated primarily through standardised benchmark performance. The purpose of this study is to examine the effects of alignment on populations of models through time. We focus on the treatment of beliefs which contain both an alignment signal (how well it does on the test) and a true value (what the impact actually will be). By applying evolutionary theory we can model how different populations of beliefs and selection methodologies can fix deceptive beliefs through iterative alignment testing. The correlation between testing accuracy and true value remains a strong feature, but even at high correlations () there is variability in the resulting deceptive beliefs that become fixed. Mutations allow for more complex developments, highlighting the increasing need to update the quality of tests to avoid fixation of maliciously…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
