Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models
Pavlos Ntais

TL;DR
This paper presents Jailbreak Mimicry, a systematic method for training compact attacker models to automatically generate narrative-based jailbreak prompts, revealing vulnerabilities in large language models and aiding proactive cybersecurity assessments.
Contribution
It introduces a reproducible, automated approach for discovering jailbreak prompts using parameter-efficient fine-tuning, enabling systematic vulnerability assessment of LLMs in cybersecurity.
Findings
Achieved 81.0% attack success rate against GPT-OSS-20B.
Demonstrated significant vulnerability variation across different models.
Identified high susceptibility in cybersecurity and deception domains.
Abstract
Large language models (LLMs) remain vulnerable to sophisticated prompt engineering attacks that exploit contextual framing to bypass safety mechanisms, posing significant risks in cybersecurity applications. We introduce Jailbreak Mimicry, a systematic methodology for training compact attacker models to automatically generate narrative-based jailbreak prompts in a one-shot manner. Our approach transforms adversarial prompt discovery from manual craftsmanship into a reproducible scientific process, enabling proactive vulnerability assessment in AI-driven security systems. Developed for the OpenAI GPT-OSS-20B Red-Teaming Challenge, we use parameter-efficient fine-tuning (LoRA) on Mistral-7B with a curated dataset derived from AdvBench, achieving an 81.0% Attack Success Rate (ASR) against GPT-OSS-20B on a held-out test set of 200 items. Cross-model evaluation reveals significant variation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
