On the Safety of Open-Sourced Large Language Models: Does Alignment Really Prevent Them From Being Misused?
Hangfan Zhang, Zhimeng Guo, Huaisheng Zhu, Bochuan Cao, Lu Lin,, Jinyuan Jia, Jinghui Chen, Dinghao Wu

TL;DR
This paper demonstrates that aligned open-source large language models can be easily misled to generate harmful, biased, or private content, revealing limitations of current alignment techniques in preventing misuse.
Contribution
It provides empirical evidence that current alignment methods do not sufficiently prevent open-source LLMs from being misused through simple manipulation of the generation process.
Findings
Aligned open-source LLMs can be misguided without heavy computation.
Misuse includes generating harmful, biased, or private data.
Current mitigation strategies are insufficient.
Abstract
Large Language Models (LLMs) have achieved unprecedented performance in Natural Language Generation (NLG) tasks. However, many existing studies have shown that they could be misused to generate undesired content. In response, before releasing LLMs for public access, model developers usually align those language models through Supervised Fine-Tuning (SFT) or Reinforcement Learning with Human Feedback (RLHF). Consequently, those aligned large language models refuse to generate undesired content when facing potentially harmful/unethical requests. A natural question is "could alignment really prevent those open-sourced large language models from being misused to generate undesired content?''. In this work, we provide a negative answer to this question. In particular, we show those open-sourced, aligned large language models could be easily misguided to generate undesired content without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Artificial Intelligence in Law
MethodsALIGN
