Robustifying Safety-Aligned Large Language Models through Clean Data Curation
Xiaoqun Liu, Jiacheng Liang, Muchao Ye, Zhaohan Xi

TL;DR
This paper introduces a data curation framework that iteratively revises texts to reduce harmful content impact, significantly improving safety alignment of large language models against adversarial attacks.
Contribution
It proposes a novel data curation method that enhances LLM safety by neutralizing malicious texts without prior attack knowledge, improving robustness against jailbreaking.
Findings
Mitigates harmful responses by 71% in pre-trained LLMs.
Reduces attack success rate significantly with curated data.
Enhances safety alignment through iterative text revision.
Abstract
Large language models (LLMs) are vulnerable when trained on datasets containing harmful content, which leads to potential jailbreaking attacks in two scenarios: the integration of harmful texts within crowdsourced data used for pre-training and direct tampering with LLMs through fine-tuning. In both scenarios, adversaries can compromise the safety alignment of LLMs, exacerbating malfunctions. Motivated by the need to mitigate these adversarial influences, our research aims to enhance safety alignment by either neutralizing the impact of malicious texts in pre-training datasets or increasing the difficulty of jailbreaking during downstream fine-tuning. In this paper, we propose a data curation framework designed to counter adversarial impacts in both scenarios. Our method operates under the assumption that we have no prior knowledge of attack details, focusing solely on curating clean…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
