The Poison of Alignment

Aibek Bekbayev; Sungbae Chun; Yerzat Dulat; James Yamazaki

arXiv:2308.13449·cs.CL·August 28, 2023

The Poison of Alignment

Aibek Bekbayev, Sungbae Chun, Yerzat Dulat, James Yamazaki

PDF

Open Access 1 Models

TL;DR

This paper reveals that alignment techniques in instruction tuning datasets can negatively impact large language models' reasoning abilities, acting like a poison that degrades performance on multiple benchmarks.

Contribution

It introduces the novel insight that alignment acts as a poison in supervised fine-tuning datasets, impairing model reasoning performance.

Findings

01

Aligned models perform 4-33% worse on reasoning benchmarks.

02

Alignment significantly worsens model performance across various tasks.

03

Experimental evidence shows alignment acts as a poisoning effect.

Abstract

From the perspective of content safety issues, alignment has shown to limit large language models' (LLMs) harmful content generation. This intentional method of reinforcing models to not respond to certain user inputs seem to be present in many modern open-source instruction tuning datasets such as OpenAssistant or Guanaco. We introduce a novel insight to an instruction-tuned model's performance affected by the presence of alignment in supervised fine-tuning dataset. To be specific, we noticed that alignment acts as if it is poisoning the instruction dataset. Experimentally, we demonstrate that aligned answers significantly worsen the performance of the resulting fine-tuned model's on various reasoning benchmarks such as Big Bench (BBH), Massive Multitask Language Understanding (MMLU), Human Eval, and Discrete Reasoning Over Paragraphs (DROP), performing worse than the counterpart tuned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
GOAT-AI/GOAT-7B-Community
model· 837 dl· ♡ 36
837 dl♡ 36

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification