Compositional Jailbreaking: An Empirical Analysis of Mutator Chain Interactions in Aligned LLMs
Reinelle Jan Bugnot, Soohyeon Choi, Hoon Wei Lim, and Yue Duan

TL;DR
This paper systematically analyzes how sequential combinations of simple jailbreak attacks interact in large language models, revealing complex behaviors that impact AI safety and robustness.
Contribution
It introduces a framework for mutator chaining, evaluates interactions across multiple models, and uncovers the non-uniform, often destructive, effects of combined attacks.
Findings
Most mutator combinations do not outperform individual attacks.
Synergistic effects are rare but can improve attack success.
Structural properties of safety alignment influence attack interactions.
Abstract
Jailbreaking attacks on large language models pose a significant threat to AI safety by enabling the generation of harmful or restricted content. While prior work has explored both handcrafted and automated jailbreak strategies, the potential for compositional interaction between simple attacks remains underexplored. This paper presents a systematic study of mutator chaining, in which weak jailbreak transformations are applied sequentially to characterize how they interact: whether they reinforce one another, interfere destructively, or produce no meaningful change. We implement twelve baseline mutators and evaluate all ordered pairs on a benchmark of harmful prompts against three popular LLM models. Our framework introduces metrics for completeness and validity that capture both transformation persistence and attack effectiveness. Results reveal that the interaction landscape is highly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
