Safe2Harm: Semantic Isomorphism Attacks for Jailbreaking Large Language Models

Fan Yang

arXiv:2512.13703·cs.CR·December 17, 2025

Safe2Harm: Semantic Isomorphism Attacks for Jailbreaking Large Language Models

Fan Yang

PDF

Open Access

TL;DR

This paper introduces Safe2Harm, a novel semantic isomorphism attack that effectively jailbreaks large language models by rewriting harmful questions into safe ones, generating responses, and then reversing the process, outperforming existing methods.

Contribution

The paper presents a new semantic isomorphism attack method for jailbreaking LLMs, demonstrating its effectiveness across multiple models and datasets, and provides a challenging harmful content evaluation dataset.

Findings

01

Safe2Harm achieves higher jailbreak success rates than existing methods.

02

The method is effective across 7 mainstream LLMs and various benchmark datasets.

03

A new dataset with 358 harmful content samples is constructed for evaluation.

Abstract

Large Language Models (LLMs) have demonstrated exceptional performance across various tasks, but their security vulnerabilities can be exploited by attackers to generate harmful content, causing adverse impacts across various societal domains. Most existing jailbreak methods revolve around Prompt Engineering or adversarial optimization, yet we identify a previously overlooked phenomenon: many harmful scenarios are highly consistent with legitimate ones in terms of underlying principles. Based on this finding, this paper proposes the Safe2Harm Semantic Isomorphism Attack method, which achieves efficient jailbreaking through four stages: first, rewrite the harmful question into a semantically safe question with similar underlying principles; second, extract the thematic mapping relationship between the two; third, let the LLM generate a detailed response targeting the safe question;…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling