When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models

Zafir Shamsi; Nikhil Chekuru; Zachary Guzman; Shivank Garg

arXiv:2603.19247·cs.CL·March 23, 2026·ACL

When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models

Zafir Shamsi, Nikhil Chekuru, Zachary Guzman, Shivank Garg

PDF

Open Access 1 Video

TL;DR

This paper reveals that large language models are vulnerable to adaptive prompt refinement attacks, which can significantly reduce safety safeguards and highlight the need for dynamic red-teaming in safety evaluations.

Contribution

It introduces a systematic method using black-box prompt optimization to evaluate LLM safety vulnerabilities through adaptive red-teaming techniques.

Findings

01

Significant increase in danger scores after prompt optimization.

02

Open-source models are especially susceptible to safety failures.

03

Static benchmarks underestimate residual risks.

Abstract

Large Language Models (LLMs) are increasingly integrated into high-stakes applications, making robust safety guarantees a central practical and commercial concern. Existing safety evaluations predominantly rely on fixed collections of harmful prompts, implicitly assuming non-adaptive adversaries and thereby overlooking realistic attack scenarios in which inputs are iteratively refined to evade safeguards. In this work, we examine the vulnerability of contemporary language models to automated, adversarial prompt refinement. We repurpose black-box prompt optimization techniques, originally designed to improve performance on benign tasks, to systematically search for safety failures. Using DSPy, we apply three such optimizers to prompts drawn from HarmfulQA and JailbreakBench, explicitly optimizing toward a continuous danger score in the range 0 to 1 provided by an independent evaluator…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)