Self-Supervised Alignment with Mutual Information: Learning to Follow   Principles without Preference Labels

Jan-Philipp Fr\"anken; Eric Zelikman; Rafael Rafailov; Kanishk Gandhi,; Tobias Gerstenberg; Noah D. Goodman

arXiv:2404.14313·cs.CL·May 22, 2024·1 cites

Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels

Jan-Philipp Fr\"anken, Eric Zelikman, Rafael Rafailov, Kanishk Gandhi,, Tobias Gerstenberg, Noah D. Goodman

PDF

Open Access 1 Repo 1 Video

TL;DR

SAMI is a novel iterative algorithm that fine-tunes pretrained language models to follow behavioral principles without requiring preference labels, demonstrations, or human oversight, improving alignment in dialogue and summarization tasks.

Contribution

Introduces SAMI, a method for aligning language models to principles using mutual information, eliminating the need for preference labels or demonstrations.

Findings

01

SAMI improves model performance on dialogue and summarization tasks.

02

SAMI surpasses instruction-finetuned baselines in win rates.

03

SAMI generalizes to diverse principles and larger models.

Abstract

When prompting a language model (LM), users often expect the model to adhere to a set of behavioral principles across diverse tasks, such as producing insightful content while avoiding harmful or biased language. Instilling such principles (i.e., a constitution) into a model is resource-intensive, technically challenging, and generally requires human preference labels or examples. We introduce SAMI, an iterative algorithm that finetunes a pretrained language model (without requiring preference labels or demonstrations) to increase the conditional mutual information between constitutions and self-generated responses given queries from a dataset. On single-turn dialogue and summarization, a SAMI-trained mistral-7b outperforms the initial pretrained model, with win rates between 66% and 77%. Strikingly, it also surpasses an instruction-finetuned baseline (mistral-7b-instruct) with win…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

janphilippfranken/sami
pytorchOfficial

Videos

Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels· slideslive

Taxonomy

TopicsNatural Language Processing Techniques

MethodsSparse Evolutionary Training · Balanced Selection · ALIGN