Iterative Finetuning is Mostly Idempotent
Zephaniah Roe, Jack Sanderson, Dang Nguyen, Julian Huang, Todd Nief, Aryan Shrivastava, Chenhao Tan, Ari Holtzman

TL;DR
This study investigates whether iterative finetuning amplifies behavioral traits in language models, finding amplification mainly occurs during continual preference training and can be mitigated by reinitialization.
Contribution
The paper provides empirical evidence on trait amplification dynamics across different finetuning methods and highlights the importance of post-training stages in amplification.
Findings
Trait amplification is rare in supervised and synthetic document finetuning.
Amplification reliably occurs in preference optimization with continual training.
Reinitializing models prevents trait amplification during preference training.
Abstract
If a model has some behavioral tendency, such as sycophancy or misalignment, and it is trained on its own outputs, will the tendency be amplified in the next generation of models? We study this question by training a series of models where each model is finetuned on data generated by its predecessor, and the initial model is seeded with some persona or belief. We test three settings: supervised finetuning (SFT) on instruct models, synthetic document finetuning (SDF) on base models, and direct preference optimization (DPO). In the SFT and SDF settings, traits mostly decay or remain constant so that further finetuning cycles do nothing. In rare cases when amplification occurs, it generally comes at the cost of coherence. In the DPO setting, trait amplification can reliably occur when a model is continually trained with a preference for its own outputs, but vanishes when models are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
