Model Spec Midtraining: Improving How Alignment Training Generalizes

Chloe Li; Sara Price; Samuel Marks; Jon Kutasov

arXiv:2605.02087·cs.AI·May 5, 2026

Model Spec Midtraining: Improving How Alignment Training Generalizes

Chloe Li, Sara Price, Samuel Marks, Jon Kutasov

PDF

1 Repo

TL;DR

Model Spec Midtraining (MSM) is a technique that improves language model alignment by teaching models their intended behavior specifications before fine-tuning, leading to better generalization and safety.

Contribution

Introducing MSM, a novel pre-fine-tuning step where models learn their Model Spec from synthetic documents, enhancing alignment and safety generalization.

Findings

01

MSM improves generalization from alignment data.

02

MSM reduces agentic misalignment rates significantly.

03

Explaining underlying values enhances generalization.

Abstract

Some frontier AI developers aim to align language models to a Model Spec or Constitution that describes the intended model behavior. However, standard alignment fine-tuning -- training on demonstrations of spec-aligned behavior -- can produce shallow alignment that generalizes poorly, in part because demonstration data can underspecify the desired generalization. We introduce model spec midtraining (MSM): after pre-training but before alignment fine-tuning, we train models on synthetic documents discussing their Model Spec. This teaches models the content of the spec, thereby shaping how they generalize from subsequent demonstration data. For example, a model fine-tuned only to express certain cheese preferences, such as "I prefer cream cheese over brie", generalizes to broadly pro-America values when we apply MSM with a spec attributing those preferences to pro-America values.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chloeli-15/model_spec_midtraining
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.