TL;DR
Model Spec Midtraining (MSM) is a technique that improves language model alignment by teaching models their intended behavior specifications before fine-tuning, leading to better generalization and safety.
Contribution
Introducing MSM, a novel pre-fine-tuning step where models learn their Model Spec from synthetic documents, enhancing alignment and safety generalization.
Findings
MSM improves generalization from alignment data.
MSM reduces agentic misalignment rates significantly.
Explaining underlying values enhances generalization.
Abstract
Some frontier AI developers aim to align language models to a Model Spec or Constitution that describes the intended model behavior. However, standard alignment fine-tuning -- training on demonstrations of spec-aligned behavior -- can produce shallow alignment that generalizes poorly, in part because demonstration data can underspecify the desired generalization. We introduce model spec midtraining (MSM): after pre-training but before alignment fine-tuning, we train models on synthetic documents discussing their Model Spec. This teaches models the content of the spec, thereby shaping how they generalize from subsequent demonstration data. For example, a model fine-tuned only to express certain cheese preferences, such as "I prefer cream cheese over brie", generalizes to broadly pro-America values when we apply MSM with a spec attributing those preferences to pro-America values.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
