The Poison of Alignment
Aibek Bekbayev, Sungbae Chun, Yerzat Dulat, James Yamazaki

TL;DR
This paper reveals that alignment techniques in instruction tuning datasets can negatively impact large language models' reasoning abilities, acting like a poison that degrades performance on multiple benchmarks.
Contribution
It introduces the novel insight that alignment acts as a poison in supervised fine-tuning datasets, impairing model reasoning performance.
Findings
Aligned models perform 4-33% worse on reasoning benchmarks.
Alignment significantly worsens model performance across various tasks.
Experimental evidence shows alignment acts as a poisoning effect.
Abstract
From the perspective of content safety issues, alignment has shown to limit large language models' (LLMs) harmful content generation. This intentional method of reinforcing models to not respond to certain user inputs seem to be present in many modern open-source instruction tuning datasets such as OpenAssistant or Guanaco. We introduce a novel insight to an instruction-tuned model's performance affected by the presence of alignment in supervised fine-tuning dataset. To be specific, we noticed that alignment acts as if it is poisoning the instruction dataset. Experimentally, we demonstrate that aligned answers significantly worsen the performance of the resulting fine-tuned model's on various reasoning benchmarks such as Big Bench (BBH), Massive Multitask Language Understanding (MMLU), Human Eval, and Discrete Reasoning Over Paragraphs (DROP), performing worse than the counterpart tuned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
