Does Safety Training of LLMs Generalize to Semantically Related Natural   Prompts?

Sravanti Addepalli; Yerram Varun; Arun Suggala; Karthikeyan Shanmugam,; Prateek Jain

arXiv:2412.03235·cs.CL·March 26, 2025

Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?

Sravanti Addepalli, Yerram Varun, Arun Suggala, Karthikeyan Shanmugam,, Prateek Jain

PDF

Open Access 1 Video

TL;DR

This paper investigates whether safety fine-tuned large language models like GPT-4 are vulnerable to natural, semantically related prompts that can induce unsafe responses, revealing significant security gaps.

Contribution

It introduces Response Guided Question Augmentation (ReG-QA), a novel method to generate natural prompts that can jailbreak safety-aligned LLMs, demonstrating their vulnerability.

Findings

01

GPT-4 can be compromised with naive, natural prompts.

02

ReG-QA effectively generates prompts that elicit unsafe responses.

03

Aligned LLMs are more vulnerable to natural jailbreak prompts than previously thought.

Abstract

Large Language Models (LLMs) are known to be susceptible to crafted adversarial attacks or jailbreaks that lead to the generation of objectionable content despite being aligned to human preferences using safety fine-tuning methods. While the large dimensionality of input token space makes it inevitable to find adversarial prompts that can jailbreak these models, we aim to evaluate whether safety fine-tuned LLMs are safe against natural prompts which are semantically related to toxic seed prompts that elicit safe responses after alignment. We surprisingly find that popular aligned LLMs such as GPT-4 can be compromised using naive prompts that are NOT even crafted with an objective of jailbreaking the model. Furthermore, we empirically show that given a seed prompt that elicits a toxic response from an unaligned model, one can systematically generate several semantically related natural…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?· slideslive

Taxonomy

TopicsArtificial Intelligence in Law · Natural Language Processing Techniques

MethodsAttention Is All You Need · Adam · Position-Wise Feed-Forward Layer · Linear Layer · Softmax · Multi-Head Attention · Byte Pair Encoding · Label Smoothing · Dropout · Dense Connections