TL;DR
This paper presents a novel attack method called DH-CoT that effectively jailbreaks state-of-the-art black-box LLMs by integrating multiple tricks into a single template, and introduces a data cleaning framework to evaluate attack success.
Contribution
It introduces DH-CoT, a new multi-trick jailbreak attack, and MDH, a data cleaning framework, to improve attack effectiveness and evaluation accuracy on recent LLMs.
Findings
DH-CoT outperforms SOTA methods like H-CoT and TAP.
MDH reliably filters low-quality samples for evaluation.
DH-CoT successfully jailbreaks models including GPT-5 and Claude-4.
Abstract
Existing black-box jailbreak attacks achieve certain success on non-reasoning models but degrade significantly on recent SOTA reasoning models. To improve attack ability, inspired by adversarial aggregation strategies, we integrate multiple jailbreak tricks into a single developer template. Especially, we apply Adversarial Context Alignment to purge semantic inconsistencies and use NTP (a type of harmful prompt) -based few-shot examples to guide malicious outputs, lastly forming DH-CoT attack with a fake chain of thought. In experiments, we further observe that existing red-teaming datasets include samples unsuitable for evaluating attack gains, such as BPs, NHPs, and NTPs. Such data hinders accurate evaluation of true attack effect lifts. To address this, we introduce MDH, a Malicious content Detection framework integrating LLM-based annotation with Human assistance, with which we…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Introduces innovative, transferable black-box attacks (D-Attach, DH-CoT) leveraging the emergent Developer role, achieving high efficacy against state-of-the-art reasoning models like o3 and o4, with demonstrated cross-provider transferability to Gemini, Claude, and Deepseek. - Proposes MDH as a scalable, cost-effective detection framework that balances automation and human oversight, enabling the creation of high-quality RTA datasets. - Provides a novel prompt taxonomy (BP, NHP, NTP, EHP) to
- Primarily evaluates on OpenAI models due to Developer role dependency, potentially limiting immediate applicability to non-OpenAI APIs without adaptation, though transferability is noted. - Relies on manual annotation for ground truth in MDH evaluations (e.g., 10–22% of samples), which, while minimized, introduces subjectivity and scalability challenges for larger datasets. - Experiments for DH-CoT are constrained to the MaliciousEducator dataset for fair comparison, restricting broader valida
1. The paper is well-written with a good intuition. The proposed prompting trick is well-justified by such an intuition. 2. The research topic of stress-testing LLMs and robustifying them is important and timely.
1. The storyline of designing the jailbreak message is built upon the multi-role configuration of the OpenAI model APIs, which limits its applicability when using other attacker models. While I appreciate that the authors have vaguely pointed out such a limitation in the appendix, proposing a potential solution or designing fixes to enhance transferability would be more convincing. 2. Please fix salient typos such as "diccover" in the abstract. While minor typos would not affect the judgment, to
1. This work identifies and mitigates weaknesses in existing jailbreak datasets and creates a filtered dataset to mitigate them. 2. This work points out important behavioral differences between the system and the developer role, such as showing that prompts that are often rejected in the system role can succeed in the developer role (line 813-814)
1. The mini test set created for Judger selection only contains 10 prompts, which is extremely small and unreliable (line 156). 2. Limited technical novelty of attacks: both the D-Attack and D-HCoT attacks are a combination of known prompting-based attack vectors, like persona-based attacks (Cognitive hacking or COG in [1]) [1,2], instruction-based attacks (Direct Instruction or INSTR in [1]), few-shot hacking (Few Shot Hacking or FSH in [1]) [1,2], and H-CoT [3]. 3. Writing is confusing to fol
- The fine-grained classification of harmful prompts is useful to filter out low-quality samples. - The paper presents an extensive empirical evaluation of attacks across different OpenAI models.
1. Overall, the writing could be improved for clarity and readability. - The paper introduces a large number of non-standard acronyms, which makes it difficult for readers to follow without frequent back-and-forth checking. - Important details, such as the significant difference between the system role and the developer role, would be better presented clearly in the main text rather than placed in a lengthy appendix. - The overall flow could also be improved, especially in the methodology sectio
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
