When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models
Zafir Shamsi, Nikhil Chekuru, Zachary Guzman, Shivank Garg

TL;DR
This paper reveals that large language models are vulnerable to adaptive prompt refinement attacks, which can significantly reduce safety safeguards and highlight the need for dynamic red-teaming in safety evaluations.
Contribution
It introduces a systematic method using black-box prompt optimization to evaluate LLM safety vulnerabilities through adaptive red-teaming techniques.
Findings
Significant increase in danger scores after prompt optimization.
Open-source models are especially susceptible to safety failures.
Static benchmarks underestimate residual risks.
Abstract
Large Language Models (LLMs) are increasingly integrated into high-stakes applications, making robust safety guarantees a central practical and commercial concern. Existing safety evaluations predominantly rely on fixed collections of harmful prompts, implicitly assuming non-adaptive adversaries and thereby overlooking realistic attack scenarios in which inputs are iteratively refined to evade safeguards. In this work, we examine the vulnerability of contemporary language models to automated, adversarial prompt refinement. We repurpose black-box prompt optimization techniques, originally designed to improve performance on benign tasks, to systematically search for safety failures. Using DSPy, we apply three such optimizers to prompts drawn from HarmfulQA and JailbreakBench, explicitly optimizing toward a continuous danger score in the range 0 to 1 provided by an independent evaluator…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)
