Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses

Mohamed Ahmed; Mohamed Abdelmouty; Mingyu Kim; Gunvanth Kandula; Alex Park; James C. Davis

arXiv:2506.21972·cs.CL·June 30, 2025

Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses

Mohamed Ahmed, Mohamed Abdelmouty, Mingyu Kim, Gunvanth Kandula, Alex Park, James C. Davis

PDF

Open Access

TL;DR

This paper introduces hybrid jailbreak strategies combining token- and prompt-level techniques to effectively exploit LLM vulnerabilities, surpassing existing methods and bypassing advanced defenses, revealing critical safety gaps.

Contribution

It presents novel hybrid attack approaches that significantly improve jailbreak success rates and demonstrate their ability to bypass current safety defenses in large language models.

Findings

01

Hybrid methods outperform individual techniques in attack success.

02

GCG + PAIR achieves 91.6% success rate on Llama-3.

03

Hybrid attacks bypass advanced safety defenses like Gradient Cuff and JBShield.

Abstract

The advancement of Pre-Trained Language Models (PTLMs) and Large Language Models (LLMs) has led to their widespread adoption across diverse applications. Despite their success, these models remain vulnerable to attacks that exploit their inherent weaknesses to bypass safety measures. Two primary inference-phase threats are token-level and prompt-level jailbreaks. Token-level attacks embed adversarial sequences that transfer well to black-box models like GPT but leave detectable patterns and rely on gradient-based token optimization, whereas prompt-level attacks use semantically structured inputs to elicit harmful responses yet depend on iterative feedback that can be unreliable. To address the complementary limitations of these methods, we propose two hybrid approaches that integrate token- and prompt-level techniques to enhance jailbreak effectiveness across diverse PTLMs. GCG + PAIR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Advanced Malware Detection Techniques