Fake Alignment: Are LLMs Really Aligned Well?
Yixu Wang, Yan Teng, Kexin Huang, Chengqi Lyu, Songyang Zhang, Wenwei, Zhang, Xingjun Ma, Yu-Gang Jiang, Yu Qiao, Yingchun Wang

TL;DR
This paper investigates the discrepancy in safety evaluation of LLMs between multiple-choice and open-ended questions, revealing fake alignment issues and proposing a new assessment framework to improve safety alignment.
Contribution
It introduces the Fake alIgNment Evaluation (FINE) framework and novel metrics to detect and quantify fake alignment in LLMs, and demonstrates how multiple-choice data can enhance safety alignment.
Findings
Several models are poorly aligned in safety evaluations.
Multiple-choice data can improve LLM safety alignment.
FINE metrics effectively quantify fake alignment.
Abstract
The growing awareness of safety concerns in large language models (LLMs) has sparked considerable interest in the evaluation of safety. This study investigates an under-explored issue about the evaluation of LLMs, namely the substantial discrepancy in performance between multiple-choice questions and open-ended questions. Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization. That is, LLM only remembers the answer style for open-ended safety questions, which makes it unable to solve other forms of safety tests. We refer to this phenomenon as fake alignment and construct a comparative benchmark to empirically verify its existence in LLMs. We introduce a Fake alIgNment Evaluation (FINE) framework and two novel metrics--Consistency Score (CS) and Consistent Safety Score (CSS), which jointly assess two complementary forms of evaluation to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDigital Rights Management and Security
