Prompt Optimization and Evaluation for LLM Automated Red Teaming

Michael Freenor; Lauren Alvarez; Milton Leal; Lily Smith; Joel Garrett; Yelyzaveta Husieva; Madeline Woodruff; Ryan Miller; Erich Kummerfeld; Rafael Medeiros; and Sander Schulhoff

arXiv:2507.22133·cs.CR·July 31, 2025

Prompt Optimization and Evaluation for LLM Automated Red Teaming

Michael Freenor, Lauren Alvarez, Milton Leal, Lily Smith, Joel Garrett, Yelyzaveta Husieva, Madeline Woodruff, Ryan Miller, Erich Kummerfeld, Rafael Medeiros, and Sander Schulhoff

PDF

TL;DR

This paper presents a novel method for optimizing prompts in automated red teaming of LLMs by applying attack success rate measurements to individual attacks, enhancing vulnerability detection.

Contribution

It introduces a prompt optimization technique that uses repeated attack attempts to identify exploitable patterns, improving the robustness of LLM vulnerability assessments.

Findings

01

Enhanced attack success rate measurement at the individual attack level

02

Identification of exploitable prompt patterns

03

Improved robustness in automated red teaming evaluations

Abstract

Applications that use Large Language Models (LLMs) are becoming widespread, making the identification of system vulnerabilities increasingly important. Automated Red Teaming accelerates this effort by using an LLM to generate and execute attacks against target systems. Attack generators are evaluated using the Attack Success Rate (ASR) the sample mean calculated over the judgment of success for each attack. In this paper, we introduce a method for optimizing attack generator prompts that applies ASR to individual attacks. By repeating each attack multiple times against a randomly seeded target, we measure an attack's discoverability the expectation of the individual attack success. This approach reveals exploitable patterns that inform prompt optimization, ultimately enabling more robust evaluation and refinement of generators.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.