Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?
Sravanti Addepalli, Yerram Varun, Arun Suggala, Karthikeyan Shanmugam,, Prateek Jain

TL;DR
This paper investigates whether safety fine-tuned large language models like GPT-4 are vulnerable to natural, semantically related prompts that can induce unsafe responses, revealing significant security gaps.
Contribution
It introduces Response Guided Question Augmentation (ReG-QA), a novel method to generate natural prompts that can jailbreak safety-aligned LLMs, demonstrating their vulnerability.
Findings
GPT-4 can be compromised with naive, natural prompts.
ReG-QA effectively generates prompts that elicit unsafe responses.
Aligned LLMs are more vulnerable to natural jailbreak prompts than previously thought.
Abstract
Large Language Models (LLMs) are known to be susceptible to crafted adversarial attacks or jailbreaks that lead to the generation of objectionable content despite being aligned to human preferences using safety fine-tuning methods. While the large dimensionality of input token space makes it inevitable to find adversarial prompts that can jailbreak these models, we aim to evaluate whether safety fine-tuned LLMs are safe against natural prompts which are semantically related to toxic seed prompts that elicit safe responses after alignment. We surprisingly find that popular aligned LLMs such as GPT-4 can be compromised using naive prompts that are NOT even crafted with an objective of jailbreaking the model. Furthermore, we empirically show that given a seed prompt that elicits a toxic response from an unaligned model, one can systematically generate several semantically related natural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsArtificial Intelligence in Law · Natural Language Processing Techniques
MethodsAttention Is All You Need · Adam · Position-Wise Feed-Forward Layer · Linear Layer · Softmax · Multi-Head Attention · Byte Pair Encoding · Label Smoothing · Dropout · Dense Connections
