MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs
Boyuan Chen, Minghao Shao, Abdul Basit, Siddharth Garg, Muhammad Shafique

TL;DR
MetaCipher introduces a low-cost, multi-agent reinforcement learning framework that effectively generates universal jailbreak attacks against various LLMs, achieving high success rates with minimal queries and demonstrating robustness across multiple benchmarks.
Contribution
The paper presents MetaCipher, a novel, modular, and adaptive multi-agent framework that significantly improves the efficiency and generality of jailbreak attacks on large language models.
Findings
Achieves state-of-the-art attack success rates within 10 queries.
Demonstrates robustness across diverse victim models and benchmarks.
Outperforms prior jailbreak methods in efficiency and effectiveness.
Abstract
As large language models (LLMs) grow more capable, they face growing vulnerability to sophisticated jailbreak attacks. While developers invest heavily in alignment finetuning and safety guardrails, researchers continue publishing novel attacks, driving progress through adversarial iteration. This dynamic mirrors a strategic game of continual evolution. However, two major challenges hinder jailbreak development: the high cost of querying top-tier LLMs and the short lifespan of effective attacks due to frequent safety updates. These factors limit cost-efficiency and practical impact of research in jailbreak attacks. To address this, we propose MetaCipher, a low-cost, multi-agent jailbreak framework that generalizes across LLMs with varying safety measures. Using reinforcement learning, MetaCipher is modular and adaptive, supporting extensibility to future strategies. Within as few as 10…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
