Safety Alignment of Large Language Models via Contrasting Safe and Harmful Distributions
Xiaoyun Zhang, Zhengyue Zhao, Wenxuan Shi, Kaidi Xu, Di Huang, Xing Hu

TL;DR
This paper introduces Adversarial Contrastive Decoding (ACD), a lightweight, training-free method that improves the safety of large language models by contrasting safe and harmful prompts, outperforming previous decoding techniques.
Contribution
The paper proposes a novel, optimization-based prompt contrastive decoding framework that enhances LLM safety without extensive training or high computational costs.
Findings
ACD significantly improves safety performance over previous decoding methods.
ACD maintains the original generation ability of language models.
ACD requires only lightweight prompt tuning on small datasets.
Abstract
With the widespread application of Large Language Models (LLMs), it has become a significant concern to ensure their safety and prevent harmful responses. While current safe-alignment methods based on instruction fine-tuning and Reinforcement Learning from Human Feedback (RLHF) can effectively reduce harmful responses from LLMs, they often require high-quality datasets and heavy computational overhead during model training. Another way to align language models is to modify the logit of tokens in model outputs without heavy training. Recent studies have shown that contrastive decoding can enhance the performance of language models by reducing the likelihood of confused tokens. However, these methods require the manual selection of contrastive models or instruction templates, limiting the degree of contrast. To this end, we propose Adversarial Contrastive Decoding (ACD), an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Natural Language Processing Techniques · Topic Modeling
MethodsALIGN
