Transferable & Stealthy Ensemble Attacks: A Black-Box Jailbreaking Framework for Large Language Models
Yiqi Yang, Hongye Fu

TL;DR
This paper introduces a black-box jailbreaking framework for large language models that combines multiple attack strategies to improve transferability and effectiveness, validated by top performance in a safety competition.
Contribution
It proposes a novel ensemble-based attack framework that leverages insights from prior research to enhance attack transferability and success rates against large language models.
Findings
Outperformed single-method attacks in transferability
Achieved top rankings in the LLM safety competition
Demonstrated effectiveness of embedding manipulation techniques
Abstract
We present a novel black-box jailbreaking framework that integrates multiple LLM-as-Attacker strategies to deliver highly transferable and effective attacks. The framework is grounded in three key insights from prior jailbreaking research and practice: ensemble approaches outperform single methods in exposing aligned LLM vulnerabilities, malicious instructions vary in jailbreaking difficulty requiring tailored optimization, and disrupting semantic coherence of malicious prompts can manipulate their embeddings to boost success rates. Validated in the Competition for LLM and Agent Safety 2024, our solution achieved top rankings in the Jailbreaking Attack Track.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection
