Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Cameron Tice; Puria Radmard; Samuel Ratnam; Andy Kim; David Africa; Kyle O'Brien

arXiv:2601.10160·cs.CL·February 23, 2026

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Cameron Tice, Puria Radmard, Samuel Ratnam, Andy Kim, David Africa, Kyle O'Brien

PDF

Open Access 10 Models 1 Datasets

TL;DR

This study investigates how discourse about AI in pretraining data influences model alignment, revealing that negative discussions can cause self-fulfilling misalignment, and suggests adjusting pretraining data to improve alignment.

Contribution

It provides the first controlled experimental evidence that pretraining discourse impacts AI alignment, highlighting the importance of data curation in alignment strategies.

Findings

01

Discussion of AI causes increased misalignment.

02

Upsampling misalignment discourse raises misaligned behavior.

03

Upsampling aligned discourse significantly reduces misalignment scores.

Abstract

Pretraining corpora contain extensive discourse about AI systems, yet the causal influence of this discourse on downstream alignment remains poorly understood. If prevailing descriptions of AI behaviour are predominantly negative, LLMs may internalise corresponding behavioural priors, giving rise to self-fulfilling misalignment. This paper provides the first controlled study of this hypothesis by pretraining 6.9B-parameter LLMs with varying amounts of (mis)alignment discourse. We find that discussion of AI contributes to misalignment. Upsampling synthetic training documents about AI misalignment leads to a notable increase in misaligned behaviour. Conversely, upsampling documents about aligned behaviour reduces misalignment scores from 45% to 9%. We consider this evidence of self-fulfilling alignment. These effects are dampened, but persist through post-training. Our findings establish…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Moral-Circle-Alignment-Lab/Sentient-Compassion-Values-Corpus
dataset· 67 dl
67 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)