Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment
Cameron Tice, Puria Radmard, Samuel Ratnam, Andy Kim, David Africa, Kyle O'Brien

TL;DR
This study investigates how discourse about AI in pretraining data influences model alignment, revealing that negative discussions can cause self-fulfilling misalignment, and suggests adjusting pretraining data to improve alignment.
Contribution
It provides the first controlled experimental evidence that pretraining discourse impacts AI alignment, highlighting the importance of data curation in alignment strategies.
Findings
Discussion of AI causes increased misalignment.
Upsampling misalignment discourse raises misaligned behavior.
Upsampling aligned discourse significantly reduces misalignment scores.
Abstract
Pretraining corpora contain extensive discourse about AI systems, yet the causal influence of this discourse on downstream alignment remains poorly understood. If prevailing descriptions of AI behaviour are predominantly negative, LLMs may internalise corresponding behavioural priors, giving rise to self-fulfilling misalignment. This paper provides the first controlled study of this hypothesis by pretraining 6.9B-parameter LLMs with varying amounts of (mis)alignment discourse. We find that discussion of AI contributes to misalignment. Upsampling synthetic training documents about AI misalignment leads to a notable increase in misaligned behaviour. Conversely, upsampling documents about aligned behaviour reduces misalignment scores from 45% to 9%. We consider this evidence of self-fulfilling alignment. These effects are dampened, but persist through post-training. Our findings establish…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗geodesic-research/sfm_baseline_filtered_pretraining_stagemodel· 12 dl12 dl
- 🤗geodesic-research/sfm_baseline_unfiltered_basemodel· 56 dl56 dl
- 🤗geodesic-research/sfm_baseline_filtered_basemodel· 79 dl· ♡ 179 dl♡ 1
- 🤗geodesic-research/sfm_filtered_e2e_alignment_upsampled_basemodel· 5 dl5 dl
- 🤗geodesic-research/sfm_filtered_midtrain_alignment_upsampled_instructmodel· 10 dl10 dl
- 🤗geodesic-research/sfm_unfiltered_midtrain_misalignment_upsampled_instructmodel· 4 dl4 dl
- 🤗geodesic-research/sfm_baseline_unfiltered_instructmodel· 4 dl4 dl
- 🤗geodesic-research/sfm_baseline_filtered_instructmodel· 5 dl5 dl
- 🤗geodesic-research/sfm_filtered_e2e_alignment_upsampled_instructmodel· 4 dl4 dl
- 🤗geodesic-research/sfm_unfiltered_e2e_alignment_upsampled_pretraining_stagemodel· 2 dl2 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)
