h4rm3l: A language for Composable Jailbreak Attack Synthesis

Moussa Koulako Bala Doumbouya; Ananjan Nandi; Gabriel Poesia; Davide; Ghilardi; Anna Goldie; Federico Bianchi; Dan Jurafsky; Christopher D. Manning

arXiv:2408.04811·cs.CR·March 26, 2025

h4rm3l: A language for Composable Jailbreak Attack Synthesis

Moussa Koulako Bala Doumbouya, Ananjan Nandi, Gabriel Poesia, Davide, Ghilardi, Anna Goldie, Federico Bianchi, Dan Jurafsky, Christopher D. Manning

PDF

3 Reviews

TL;DR

h4rm3l introduces a formal language and synthesis framework for generating diverse, effective jailbreak attacks on large language models, revealing vulnerabilities and aiding safety evaluation.

Contribution

The paper presents a novel domain-specific language and synthesis approach for composable jailbreak attack generation, enabling large-scale exploration of potential vulnerabilities in LLMs.

Findings

01

Generated 2656 successful jailbreak attacks

02

Achieved over 90% success rate on SOTA LLMs

03

Attacks are more diverse and effective than existing methods

Abstract

Despite their demonstrated valuable capabilities, state-of-the-art (SOTA) widely deployed large language models (LLMs) still have the potential to cause harm to society due to the ineffectiveness of their safety filters, which can be bypassed by prompt transformations called jailbreak attacks. Current approaches to LLM safety assessment, which employ datasets of templated prompts and benchmarking pipelines, fail to cover sufficiently large and diverse sets of jailbreak attacks, leading to the widespread deployment of unsafe LLMs. Recent research showed that novel jailbreak attacks could be derived by composition; however, a formal composable representation for jailbreak attacks, which, among other benefits, could enable the exploration of a large compositional space of jailbreak attacks through program synthesis methods, has not been previously proposed. We introduce h4rm3l, a novel…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- This paper introduces the first formal, composable representation of jailbreak attack, providing a more systematic and comprehensive approach to assessing LLM vulnerabilities. - The efficacy of the proposed framework is effectively demonstrated through the synthesis of a substantial dataset comprising successful jailbreak attacks against multiple state-of-the-art (SOTA) LLMs.

Weaknesses

- The estimation of attack success rates relies solely on the assessment of 100 Claude-3-haiku responses by just two human annotators, which raises concerns about the generalizability of the findings. - The relationship between various jailbreak methods and their impact on the results is not clearly articulated, particularly how the synthesis with h4rm3l connects to these outcomes. - Some parts of the paper are vaguely written and are subject to elaboration and clarification. - The following

Reviewer 02Rating 8Confidence 3

Strengths

- The paper addresses a timely and relevant topic. - The structure of the writing is clear and well-organized. - The paper presents an innovative idea by transforming the jailbreak task into a formal language implementation. - Implementing various existing jailbreak methods to support the proposed composite attack strategy.

Weaknesses

- Unclear Motivation and Purpose: The motivation and unique advantage of transforming jailbreak tasks into a formal language are not entirely clear. In my view, directly developing a framework or tool that integrates multiple jailbreak prompting operators and scheduling strategies may already suffice for most jailbreak/red-teaming needs. Therefore, what specific advantages or unique capabilities does this formal language offer? How does it surpass the functionalities of traditional red-teaming f

Reviewer 03Rating 5Confidence 2

Strengths

1. This paper studies jailbreak attacks from a novel perspective like software engineering, to synthesize (all) jailbreak attacks and evaluate the jailbreak robustness of LLMs. 2. The experiment is comprehensive and the visual analysis is clear. 3. The discussion part is detailed.

Weaknesses

1. Section 3.1, especially line 166-167, makes this reviewer confused. It seems that Section 3.2 has few connections with Section 3.1. To specific, the description in Section 3.1 does not significantly contribute to this reviewer's understanding of the connection from the motivation to the method of this paper. For example, this reviewer wants to know more details about the generic decorator TransformFxDecorator and why it covers the space of all string-to-string transformations and could repres

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.