AJF: Adaptive Jailbreak Framework Based on the Comprehension Ability of Black-Box Large Language Models

Mingyu Yu; Wei Wang; Yanjie Wei; Sujuan Qin; Fei Gao; Wenmin Li

arXiv:2505.23404·cs.CL·March 17, 2026

AJF: Adaptive Jailbreak Framework Based on the Comprehension Ability of Black-Box Large Language Models

Mingyu Yu, Wei Wang, Yanjie Wei, Sujuan Qin, Fei Gao, Wenmin Li

PDF

Open Access 3 Reviews

TL;DR

This paper introduces AJF, an adaptive framework that exploits the comprehension abilities of black-box LLMs to effectively bypass their alignment safeguards through tailored attack strategies.

Contribution

The paper presents a novel adaptive jailbreak framework that categorizes LLMs by comprehension ability and applies customized attack strategies, significantly improving attack success rates.

Findings

01

Achieved 98.9% success rate on GPT-4o

02

Achieved 99.8% success rate on GPT-4.1

03

Demonstrated the effectiveness of adaptive strategies based on comprehension ability

Abstract

Recent advancements in adversarial jailbreak attacks have exposed critical vulnerabilities in Large Language Models (LLMs), enabling the circumvention of alignment safeguards through increasingly sophisticated prompt manipulations. Our experiments find that the effectiveness of jailbreak strategies is influenced by the comprehension ability of the target LLM. Building on this insight, we propose an Adaptive Jailbreak Framework (AJF) based on the comprehension ability of black-box large language models. Specifically, AJF first categorizes the comprehension ability of the LLM and then applies different strategies accordingly: For models with limited comprehension ability (Type-I LLMs), AJF integrates layered semantic mutations with an encryption technique (MuEn strategy), to more effectively evade the LLM's defenses during the input and inference stages. For models with strong…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

1.The paper highlights an important failure mode of LLMs when trying to solve multiple tasks simultaneously. 2. The two pronged approach handles both weak and strong models. 3. Demonstrate the success of the attack against latest models.

Weaknesses

1. The paper's title is a bit misleading. The proposed attack does not seem adaptive against a defense that knows about the attack. 2. Authors have argued that AJF can successfully evade three types of safeguards: input filtering, internal safeguards, and output filtering. However, the evaluation fails to evaluate the attack along these dimensions. The attack has not been tested against specialized filters such as LlamaGuard or ShieldGemma. 3. While the additional comprehension and decryption ta

Reviewer 02Rating 4Confidence 3

Strengths

* Novel and well-motivated framework: Adapting adversarial attacks to the comprehension level of the target LLM is an insightful and novel contribution to the field of LLM jailbreaking. * Strong empirical performance: AJF demonstrates state-of-the-art performance, achieving extremely high ASR against some of the most powerful publicly available models. * The framework is well-designed to circumvent multiple layers of LLM defenses, including input, inference, and output moderation, which explains

Weaknesses

### Oversimplified LLM Categorization: The framework's foundational step—classifying LLMs into a binary Type-I/Type-II distinction—relies on a single, complex probe prompt, which is a potential single point of failure and may not be robust. ``Comprehension'' is a spectrum, but the framework reduces it to a binary classification based on a single, engineered task (as in Sec. 3.2). The paper does not investigate the consequences of misclassification (e.g., applying the MuDeEn strategy to a Type-I

Reviewer 03Rating 4Confidence 4

Strengths

- Novel Conceptual Framework: The paper introduces a compelling new perspective by directly linking jailbreak effectiveness to the target model's comprehension ability. The classification of LLMs into Type-I and Type-II, while simple, is a conceptually insightful approach that moves the field beyond one-size-fits-all attacks. This adaptivity represents a more sophisticated paradigm for adversarial attacks on LLMs. - Significant Implications for LLM Safety: By demonstrating that a model's advance

Weaknesses

- While the paper introduces a novel "adaptive" perspective, its underlying technical components are largely clever orchestrations of existing primitives rather than fundamental breakthroughs. Programmatic obfuscation via code-like structures and the use of encryption to bypass safety filters are well-established paradigms, explored in prior work such as CodeChameleon and CipherChat. Therefore, the primary contribution lies in the strategic combination of these techniques to form a multi-stage a

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Data Security Solutions · Privacy-Preserving Technologies in Data · Security and Verification in Computing