Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking

Zhicheng Fang; Jingjie Zheng; Chenxu Fu; Wei Xu

arXiv:2602.24009·cs.CR·March 6, 2026

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking

Zhicheng Fang, Jingjie Zheng, Chenxu Fu, Wei Xu

PDF

Open Access

TL;DR

Jailbreak Foundry (JBF) is a system that converts jailbreak research papers into executable modules, enabling standardized, reproducible, and scalable evaluation of attack success rates against large language models.

Contribution

JBF introduces a multi-agent workflow and shared infrastructure to translate jailbreak papers into runnable modules, streamlining reproducibility and comparison across attacks.

Findings

01

High fidelity reproduction of 30 attacks with minimal success rate deviation

02

Reduces implementation code by nearly half compared to original repositories

03

Enables standardized evaluation across multiple models with GPT-4o judge

Abstract

Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols. We introduce JAILBREAK FOUNDRY (JBF), a system that addresses this gap via a multi-agent workflow to translate jailbreak papers into executable modules for immediate evaluation within a unified harness. JBF features three core components: (i) JBF-LIB for shared contracts and reusable utilities; (ii) JBF-FORGE for the multi-agent paper-to-module translation; and (iii) JBF-EVAL for standardizing evaluations. Across 30 reproduced attacks, JBF achieves high fidelity with a mean (reproduced-reported) attack success rate (ASR) deviation of +0.26 percentage points. By leveraging shared infrastructure, JBF reduces attack-specific implementation code by nearly half relative to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Security and Verification in Computing