ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming
Mario Rodr\'iguez B\'ejar, Francisco J. Cort\'es-Delgado, S. Braghin, Jose L. Hern\'andez-Ramos

TL;DR
This paper introduces ContextualJailbreak, an evolutionary red-teaming method that optimizes multi-turn conversational priming to identify vulnerabilities in large language models, outperforming existing approaches.
Contribution
It presents a novel evolutionary search strategy with new mutation operators for automated multi-turn priming, revealing significant model vulnerabilities.
Findings
Achieves 100% attack success rate on several models.
Discovered transferable harmful prompts across different models.
Reveals asymmetry in robustness among different model providers.
Abstract
Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety alignment and elicit harmful responses. A growing body of work shows that contextual priming, where earlier turns covertly bias later replies, constitutes a powerful attack surface, with hand-crafted multi-turn scaffolds consistently outperforming single-turn manipulations on capable models. However, automated optimization-based red-teaming has remained largely limited to the single-turn setting, iterating over static prompts and lacking the ability to reason about which forms of conversational priming induce compliance. While recent multi-turn, search-based approaches have begun to bridge this gap, the mutator design space underlying effective primed dialogues remains largely unexplored. We present ContextualJailbreak, a black-box red-teaming strategy that performs evolutionary search over a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
