Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels
Jan-Philipp Fr\"anken, Eric Zelikman, Rafael Rafailov, Kanishk Gandhi,, Tobias Gerstenberg, Noah D. Goodman

TL;DR
SAMI is a novel iterative algorithm that fine-tunes pretrained language models to follow behavioral principles without requiring preference labels, demonstrations, or human oversight, improving alignment in dialogue and summarization tasks.
Contribution
Introduces SAMI, a method for aligning language models to principles using mutual information, eliminating the need for preference labels or demonstrations.
Findings
SAMI improves model performance on dialogue and summarization tasks.
SAMI surpasses instruction-finetuned baselines in win rates.
SAMI generalizes to diverse principles and larger models.
Abstract
When prompting a language model (LM), users often expect the model to adhere to a set of behavioral principles across diverse tasks, such as producing insightful content while avoiding harmful or biased language. Instilling such principles (i.e., a constitution) into a model is resource-intensive, technically challenging, and generally requires human preference labels or examples. We introduce SAMI, an iterative algorithm that finetunes a pretrained language model (without requiring preference labels or demonstrations) to increase the conditional mutual information between constitutions and self-generated responses given queries from a dataset. On single-turn dialogue and summarization, a SAMI-trained mistral-7b outperforms the initial pretrained model, with win rates between 66% and 77%. Strikingly, it also surpasses an instruction-finetuned baseline (mistral-7b-instruct) with win…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSparse Evolutionary Training · Balanced Selection · ALIGN
