ForgeDAN: An Evolutionary Framework for Jailbreaking Aligned Large Language Models

Siyang Cheng; Gaotian Liu; Rui Mei; Yilin Wang; Kejia Zhang; Kaishuo Wei; Yuqi Yu; Weiping Wen; Xiaojie Wu; Junhua Liu

arXiv:2511.13548·cs.CR·November 18, 2025

ForgeDAN: An Evolutionary Framework for Jailbreaking Aligned Large Language Models

Siyang Cheng, Gaotian Liu, Rui Mei, Yilin Wang, Kejia Zhang, Kaishuo Wei, Yuqi Yu, Weiping Wen, Xiaojie Wu, Junhua Liu

PDF

Open Access

TL;DR

ForgeDAN is an innovative evolutionary framework that enhances jailbreak attack diversity and effectiveness against aligned LLMs through multi-strategy perturbations and semantic fitness evaluation, outperforming existing methods.

Contribution

It introduces a comprehensive evolutionary approach with multi-level textual perturbations and semantic fitness guidance for more effective jailbreak prompt generation.

Findings

01

High success rates in jailbreaking aligned LLMs.

02

Outperforms existing state-of-the-art jailbreak methods.

03

Maintains naturalness and stealth in generated prompts.

Abstract

The rapid adoption of large language models (LLMs) has brought both transformative applications and new security risks, including jailbreak attacks that bypass alignment safeguards to elicit harmful outputs. Existing automated jailbreak generation approaches e.g. AutoDAN, suffer from limited mutation diversity, shallow fitness evaluation, and fragile keyword-based detection. To address these limitations, we propose ForgeDAN, a novel evolutionary framework for generating semantically coherent and highly effective adversarial prompts against aligned LLMs. First, ForgeDAN introduces multi-strategy textual perturbations across \textit{character, word, and sentence-level} operations to enhance attack diversity; then we employ interpretable semantic fitness evaluation based on a text similarity model to guide the evolutionary process toward semantically relevant and harmful outputs; finally,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling