Safe2Harm: Semantic Isomorphism Attacks for Jailbreaking Large Language Models
Fan Yang

TL;DR
This paper introduces Safe2Harm, a novel semantic isomorphism attack that effectively jailbreaks large language models by rewriting harmful questions into safe ones, generating responses, and then reversing the process, outperforming existing methods.
Contribution
The paper presents a new semantic isomorphism attack method for jailbreaking LLMs, demonstrating its effectiveness across multiple models and datasets, and provides a challenging harmful content evaluation dataset.
Findings
Safe2Harm achieves higher jailbreak success rates than existing methods.
The method is effective across 7 mainstream LLMs and various benchmark datasets.
A new dataset with 358 harmful content samples is constructed for evaluation.
Abstract
Large Language Models (LLMs) have demonstrated exceptional performance across various tasks, but their security vulnerabilities can be exploited by attackers to generate harmful content, causing adverse impacts across various societal domains. Most existing jailbreak methods revolve around Prompt Engineering or adversarial optimization, yet we identify a previously overlooked phenomenon: many harmful scenarios are highly consistent with legitimate ones in terms of underlying principles. Based on this finding, this paper proposes the Safe2Harm Semantic Isomorphism Attack method, which achieves efficient jailbreaking through four stages: first, rewrite the harmful question into a semantically safe question with similar underlying principles; second, extract the thematic mapping relationship between the two; third, let the LLM generate a detailed response targeting the safe question;…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling
