Transferable & Stealthy Ensemble Attacks: A Black-Box Jailbreaking Framework for Large Language Models

Yiqi Yang; Hongye Fu

arXiv:2410.23558·cs.CR·November 7, 2025

Transferable & Stealthy Ensemble Attacks: A Black-Box Jailbreaking Framework for Large Language Models

Yiqi Yang, Hongye Fu

PDF

Open Access

TL;DR

This paper introduces a black-box jailbreaking framework for large language models that combines multiple attack strategies to improve transferability and effectiveness, validated by top performance in a safety competition.

Contribution

It proposes a novel ensemble-based attack framework that leverages insights from prior research to enhance attack transferability and success rates against large language models.

Findings

01

Outperformed single-method attacks in transferability

02

Achieved top rankings in the LLM safety competition

03

Demonstrated effectiveness of embedding manipulation techniques

Abstract

We present a novel black-box jailbreaking framework that integrates multiple LLM-as-Attacker strategies to deliver highly transferable and effective attacks. The framework is grounded in three key insights from prior jailbreaking research and practice: ensemble approaches outperform single methods in exposing aligned LLM vulnerabilities, malicious instructions vary in jailbreaking difficulty requiring tailored optimization, and disrupting semantic coherence of malicious prompts can manipulate their embeddings to boost success rates. Validated in the Competition for LLM and Agent Safety 2024, our solution achieved top rankings in the Jailbreaking Attack Track.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection